# Phase 2 Enhancement: Delta-based transcript via existing WebSocket **Created:** 2026-05-07 **Status:** Complete **Depends on:** Phase 2 (Complete) --- ## 1. Goal Replace full-text WebSocket responses with **delta-only updates**. Backend computes the difference between consecutive `text` fields and sends only new characters. Frontend appends them client-side. All over the existing WebSocket. ## 2. Why `text` field over `stash` Stash log analysis revealed DashScope partial events have TWO text fields: | Field | Behavior | Description | |-------|----------|-------------| | `stash` | Sliding window, ~7-20 chars | Latest uncommitted characters | | `text` | **Monotonically growing** (within one `item_id`) | Stable committed transcription | `text` is the stable, cumulative transcription. It grows monotonically within each utterance (`item_id`). Delta = `text[len(prev_text):]` — simple suffix diff, no merge logic needed. `stash` is still useful: it contains trailing chars not yet committed to `text`. Sent alongside deltas so the frontend can append it on pause (completing the last sentence). ## 3. Architecture ``` DashScope partial event: {"text": "多謝主席咁啊", "stash": "咁啊亦都", "item_id": "item_ABC"} Backend computes: delta = text[len(prev_text):] → "咁啊" (new chars only) sends: {"delta":"咁啊", "stash":"咁啊亦都", "is_final":false} Frontend: transcriptRef += msg.delta → accumulates locally lastStashRef = msg.stash → stores for pause handler On pause: if stash not already at end of text: text += stash → "...可以诶处理埋诶呢啲余。" onFinalTranscript(text) → persists in QueryInput ``` ## 4. Key Design Decisions ### 4.1 Stash-only events skipped (text empty) Early in each utterance, `text` is `""` while `stash` has sliding window content. These are skipped — delta can't be computed from a sliding window. Once `text` starts populating (after ~1-2s), deltas stream. ### 4.2 Utterance boundary tracking DashScope splits speech into `item_id` segments. When `item_id` changes, `text` resets to `""`. On detection: - `prev_display` set to `" "` (space) so the next utterance's full text is sent as delta, prepended with a space - Result: `"...上一句 融資安排方面..."` — utterances flow continuously ### 4.3 Trailing stash on pause The `text` field lags behind — final chars of each utterance are only in `stash`. Every WS message includes the current `stash`. Frontend stores it. On pause, appends stash to text if not already overlapping. ## 5. Backend Implementation (`ws_asr.py`) ```python prev_display = "" current_item_id = "" async def read_events(): nonlocal accumulated_text, prev_display, current_item_id while True: event = await event_queue.get() result = format_transcription_event(event, accumulated_text) if result is None: continue if result["is_final"]: # Completed utterance: send full accumulated text transcript = event.get("transcript", "") if transcript and transcript.strip(): accumulated_text = build_display_text(accumulated_text, transcript) prev_display = "" result["delta"] = "" result["full_text"] = _to_traditional(accumulated_text) else: text = result.pop("text", "") stash = result.pop("stash", "") # Utterance boundary: item_id change resets text to empty item_id = event.get("item_id", "") if item_id and item_id != current_item_id: if prev_display: prev_display = " " # prepend space for next utterance current_item_id = item_id # text is monotonically growing within one utterance if text.strip(): new_delta = "" if text != prev_display: if prev_display and text.startswith(prev_display): new_delta = text[len(prev_display):] else: new_delta = text prev_display = text result["delta"] = _to_traditional(new_delta) if new_delta else "" result["full_text"] = "" result["stash"] = _to_traditional(stash) if stash.strip() else "" else: # text empty (new utterance starting) — skip continue await client_ws.send_json(result) ``` ## 6. Frontend Implementation (`useVideoASR.ts`) ```typescript const transcriptRef = useRef('') const lastStashRef = useRef('') ws.onmessage = (e) => { const msg = JSON.parse(e.data) if (msg.is_final && msg.full_text) { transcriptRef.current = msg.full_text lastStashRef.current = '' setTranscript(msg.full_text) setPartialTranscript('') onFinalTranscriptRef.current?.(msg.full_text) } else if (msg.delta) { transcriptRef.current += msg.delta lastStashRef.current = msg.stash || '' setTranscript(transcriptRef.current) setPartialTranscript(transcriptRef.current) } } const stopStreaming = useCallback(() => { closeWebSocket() let text = transcriptRef.current.trim() const stash = lastStashRef.current.trim() if (stash && !text.endsWith(stash)) { text += stash } lastStashRef.current = '' if (text) { onFinalTranscriptRef.current?.(text) setPartialTranscript('') } }, [closeWebSocket]) ``` ## 7. WebSocket Message Format ```json // Partial event (new chars only, with stash): {"delta":"咁啊", "stash":"咁啊亦都", "full_text":"", "language":"yue", "is_final":false} // Completed event (full accumulated text): {"delta":"", "stash":"", "full_text":"多謝主席咁啊亦都...", "language":"yue", "is_final":true} ``` ## 8. Files Changed | File | Change | |------|--------| | `backend/app/routers/ws_asr.py` | Delta computation from `text` field, `item_id` tracking, `stash` passthrough | | `backend/app/test/test_phase2_ws_protocol.py` | Updated partial event tests for `text`/`stash` fields | | `frontend/src/hooks/useVideoASR.ts` | `lastStashRef`, delta accumulation, stash append on pause | | `frontend/src/types/index.ts` | Added `stash?: string` to `ASRMessage` | ## 9. Edge Cases Handled | Case | Handling | |------|----------| | `text` empty (new utterance) | Skip — delta not computable from sliding stash | | `item_id` changes | Reset `prev_display`, prepend space to next utterance | | Trailing stash not in `text` | Frontend appends on pause via `lastStashRef` | | `text` doesn't start with `prev_display` | Send entire `text` as delta (utterance boundary) | | Multiple completed events | Each resets `prev_display` and `lastStash` | | Play again after pause | New WS session, all state reset | ## 10. Lessons Learned - **`text` is the right field** — monotonically growing, simple delta = suffix diff - **`stash` is a sliding window** — can't be accumulated, can't be delta-diffed - **`item_id` matters** — DashScope splits speech into items; text resets on item change - **`_merge_stash` not needed** — `text` is already cumulative; stash only needed for trailing chars - **`replace: true` destroys cross-utterance text** — simpler to just append with space separator - **Stash on pause completes sentences** — `text` lags ~200ms behind; stash fills the gap