7.3 KiB
Phase 2 Enhancement: Delta-based transcript via existing WebSocket
Created: 2026-05-07 Status: Complete Depends on: Phase 2 (Complete)
1. Goal
Replace full-text WebSocket responses with delta-only updates. Backend computes the difference between consecutive text fields and sends only new characters. Frontend appends them client-side. All over the existing WebSocket.
2. Why text field over stash
Stash log analysis revealed DashScope partial events have TWO text fields:
| Field | Behavior | Description |
|---|---|---|
stash |
Sliding window, ~7-20 chars | Latest uncommitted characters |
text |
Monotonically growing (within one item_id) |
Stable committed transcription |
text is the stable, cumulative transcription. It grows monotonically within each utterance (item_id). Delta = text[len(prev_text):] — simple suffix diff, no merge logic needed.
stash is still useful: it contains trailing chars not yet committed to text. Sent alongside deltas so the frontend can append it on pause (completing the last sentence).
3. Architecture
DashScope partial event:
{"text": "多謝主席咁啊", "stash": "咁啊亦都", "item_id": "item_ABC"}
Backend computes:
delta = text[len(prev_text):] → "咁啊" (new chars only)
sends: {"delta":"咁啊", "stash":"咁啊亦都", "is_final":false}
Frontend:
transcriptRef += msg.delta → accumulates locally
lastStashRef = msg.stash → stores for pause handler
On pause:
if stash not already at end of text:
text += stash → "...可以诶处理埋诶呢啲余。"
onFinalTranscript(text) → persists in QueryInput
4. Key Design Decisions
4.1 Stash-only events skipped (text empty)
Early in each utterance, text is "" while stash has sliding window content. These are skipped — delta can't be computed from a sliding window. Once text starts populating (after ~1-2s), deltas stream.
4.2 Utterance boundary tracking
DashScope splits speech into item_id segments. When item_id changes, text resets to "". On detection:
prev_displayset to" "(space) so the next utterance's full text is sent as delta, prepended with a space- Result:
"...上一句 融資安排方面..."— utterances flow continuously
4.3 Trailing stash on pause
The text field lags behind — final chars of each utterance are only in stash. Every WS message includes the current stash. Frontend stores it. On pause, appends stash to text if not already overlapping.
5. Backend Implementation (ws_asr.py)
prev_display = ""
current_item_id = ""
async def read_events():
nonlocal accumulated_text, prev_display, current_item_id
while True:
event = await event_queue.get()
result = format_transcription_event(event, accumulated_text)
if result is None:
continue
if result["is_final"]:
# Completed utterance: send full accumulated text
transcript = event.get("transcript", "")
if transcript and transcript.strip():
accumulated_text = build_display_text(accumulated_text, transcript)
prev_display = ""
result["delta"] = ""
result["full_text"] = _to_traditional(accumulated_text)
else:
text = result.pop("text", "")
stash = result.pop("stash", "")
# Utterance boundary: item_id change resets text to empty
item_id = event.get("item_id", "")
if item_id and item_id != current_item_id:
if prev_display:
prev_display = " " # prepend space for next utterance
current_item_id = item_id
# text is monotonically growing within one utterance
if text.strip():
new_delta = ""
if text != prev_display:
if prev_display and text.startswith(prev_display):
new_delta = text[len(prev_display):]
else:
new_delta = text
prev_display = text
result["delta"] = _to_traditional(new_delta) if new_delta else ""
result["full_text"] = ""
result["stash"] = _to_traditional(stash) if stash.strip() else ""
else:
# text empty (new utterance starting) — skip
continue
await client_ws.send_json(result)
6. Frontend Implementation (useVideoASR.ts)
const transcriptRef = useRef('')
const lastStashRef = useRef('')
ws.onmessage = (e) => {
const msg = JSON.parse(e.data)
if (msg.is_final && msg.full_text) {
transcriptRef.current = msg.full_text
lastStashRef.current = ''
setTranscript(msg.full_text)
setPartialTranscript('')
onFinalTranscriptRef.current?.(msg.full_text)
} else if (msg.delta) {
transcriptRef.current += msg.delta
lastStashRef.current = msg.stash || ''
setTranscript(transcriptRef.current)
setPartialTranscript(transcriptRef.current)
}
}
const stopStreaming = useCallback(() => {
closeWebSocket()
let text = transcriptRef.current.trim()
const stash = lastStashRef.current.trim()
if (stash && !text.endsWith(stash)) {
text += stash
}
lastStashRef.current = ''
if (text) {
onFinalTranscriptRef.current?.(text)
setPartialTranscript('')
}
}, [closeWebSocket])
7. WebSocket Message Format
// Partial event (new chars only, with stash):
{"delta":"咁啊", "stash":"咁啊亦都", "full_text":"", "language":"yue", "is_final":false}
// Completed event (full accumulated text):
{"delta":"", "stash":"", "full_text":"多謝主席咁啊亦都...", "language":"yue", "is_final":true}
8. Files Changed
| File | Change |
|---|---|
backend/app/routers/ws_asr.py |
Delta computation from text field, item_id tracking, stash passthrough |
backend/app/test/test_phase2_ws_protocol.py |
Updated partial event tests for text/stash fields |
frontend/src/hooks/useVideoASR.ts |
lastStashRef, delta accumulation, stash append on pause |
frontend/src/types/index.ts |
Added stash?: string to ASRMessage |
9. Edge Cases Handled
| Case | Handling |
|---|---|
text empty (new utterance) |
Skip — delta not computable from sliding stash |
item_id changes |
Reset prev_display, prepend space to next utterance |
Trailing stash not in text |
Frontend appends on pause via lastStashRef |
text doesn't start with prev_display |
Send entire text as delta (utterance boundary) |
| Multiple completed events | Each resets prev_display and lastStash |
| Play again after pause | New WS session, all state reset |
10. Lessons Learned
textis the right field — monotonically growing, simple delta = suffix diffstashis a sliding window — can't be accumulated, can't be delta-diffeditem_idmatters — DashScope splits speech into items; text resets on item change_merge_stashnot needed —textis already cumulative; stash only needed for trailing charsreplace: truedestroys cross-utterance text — simpler to just append with space separator- Stash on pause completes sentences —
textlags ~200ms behind; stash fills the gap