7.3 KiB

Raw Blame History

Phase 2 Enhancement: Delta-based transcript via existing WebSocket

Created: 2026-05-07 Status: Complete Depends on: Phase 2 (Complete)

1. Goal

Replace full-text WebSocket responses with delta-only updates. Backend computes the difference between consecutive text fields and sends only new characters. Frontend appends them client-side. All over the existing WebSocket.

2. Why `text` field over `stash`

Stash log analysis revealed DashScope partial events have TWO text fields:

Field	Behavior	Description
`stash`	Sliding window, ~7-20 chars	Latest uncommitted characters
`text`	Monotonically growing (within one `item_id`)	Stable committed transcription

text is the stable, cumulative transcription. It grows monotonically within each utterance (item_id). Delta = text[len(prev_text):] — simple suffix diff, no merge logic needed.

stash is still useful: it contains trailing chars not yet committed to text. Sent alongside deltas so the frontend can append it on pause (completing the last sentence).

3. Architecture

DashScope partial event:
  {"text": "多謝主席咁啊", "stash": "咁啊亦都", "item_id": "item_ABC"}

Backend computes:
  delta = text[len(prev_text):]       → "咁啊" (new chars only)
  sends: {"delta":"咁啊", "stash":"咁啊亦都", "is_final":false}

Frontend:
  transcriptRef += msg.delta          → accumulates locally
  lastStashRef = msg.stash            → stores for pause handler

On pause:
  if stash not already at end of text:
    text += stash                     → "...可以诶处理埋诶呢啲余。"
  onFinalTranscript(text)             → persists in QueryInput

4. Key Design Decisions

4.1 Stash-only events skipped (text empty)

Early in each utterance, text is "" while stash has sliding window content. These are skipped — delta can't be computed from a sliding window. Once text starts populating (after ~1-2s), deltas stream.

4.2 Utterance boundary tracking

DashScope splits speech into item_id segments. When item_id changes, text resets to "". On detection:

prev_display set to " " (space) so the next utterance's full text is sent as delta, prepended with a space
Result: "...上一句融資安排方面..." — utterances flow continuously

4.3 Trailing stash on pause

The text field lags behind — final chars of each utterance are only in stash. Every WS message includes the current stash. Frontend stores it. On pause, appends stash to text if not already overlapping.

5. Backend Implementation (`ws_asr.py`)

prev_display = ""
current_item_id = ""

async def read_events():
    nonlocal accumulated_text, prev_display, current_item_id
    while True:
        event = await event_queue.get()
        result = format_transcription_event(event, accumulated_text)
        if result is None:
            continue
        if result["is_final"]:
            # Completed utterance: send full accumulated text
            transcript = event.get("transcript", "")
            if transcript and transcript.strip():
                accumulated_text = build_display_text(accumulated_text, transcript)
            prev_display = ""
            result["delta"] = ""
            result["full_text"] = _to_traditional(accumulated_text)
        else:
            text = result.pop("text", "")
            stash = result.pop("stash", "")

            # Utterance boundary: item_id change resets text to empty
            item_id = event.get("item_id", "")
            if item_id and item_id != current_item_id:
                if prev_display:
                    prev_display = " "  # prepend space for next utterance
                current_item_id = item_id

            # text is monotonically growing within one utterance
            if text.strip():
                new_delta = ""
                if text != prev_display:
                    if prev_display and text.startswith(prev_display):
                        new_delta = text[len(prev_display):]
                    else:
                        new_delta = text
                    prev_display = text
                result["delta"] = _to_traditional(new_delta) if new_delta else ""
                result["full_text"] = ""
                result["stash"] = _to_traditional(stash) if stash.strip() else ""
            else:
                # text empty (new utterance starting) — skip
                continue
        await client_ws.send_json(result)

6. Frontend Implementation (`useVideoASR.ts`)

const transcriptRef = useRef('')
const lastStashRef = useRef('')

ws.onmessage = (e) => {
    const msg = JSON.parse(e.data)
    if (msg.is_final && msg.full_text) {
        transcriptRef.current = msg.full_text
        lastStashRef.current = ''
        setTranscript(msg.full_text)
        setPartialTranscript('')
        onFinalTranscriptRef.current?.(msg.full_text)
    } else if (msg.delta) {
        transcriptRef.current += msg.delta
        lastStashRef.current = msg.stash || ''
        setTranscript(transcriptRef.current)
        setPartialTranscript(transcriptRef.current)
    }
}

const stopStreaming = useCallback(() => {
    closeWebSocket()
    let text = transcriptRef.current.trim()
    const stash = lastStashRef.current.trim()
    if (stash && !text.endsWith(stash)) {
        text += stash
    }
    lastStashRef.current = ''
    if (text) {
        onFinalTranscriptRef.current?.(text)
        setPartialTranscript('')
    }
}, [closeWebSocket])

7. WebSocket Message Format

// Partial event (new chars only, with stash):
{"delta":"咁啊", "stash":"咁啊亦都", "full_text":"", "language":"yue", "is_final":false}

// Completed event (full accumulated text):
{"delta":"", "stash":"", "full_text":"多謝主席咁啊亦都...", "language":"yue", "is_final":true}

8. Files Changed

File	Change
`backend/app/routers/ws_asr.py`	Delta computation from `text` field, `item_id` tracking, `stash` passthrough
`backend/app/test/test_phase2_ws_protocol.py`	Updated partial event tests for `text`/`stash` fields
`frontend/src/hooks/useVideoASR.ts`	`lastStashRef`, delta accumulation, stash append on pause
`frontend/src/types/index.ts`	Added `stash?: string` to `ASRMessage`

9. Edge Cases Handled

Case	Handling
`text` empty (new utterance)	Skip — delta not computable from sliding stash
`item_id` changes	Reset `prev_display`, prepend space to next utterance
Trailing stash not in `text`	Frontend appends on pause via `lastStashRef`
`text` doesn't start with `prev_display`	Send entire `text` as delta (utterance boundary)
Multiple completed events	Each resets `prev_display` and `lastStash`
Play again after pause	New WS session, all state reset

10. Lessons Learned

text is the right field — monotonically growing, simple delta = suffix diff
stash is a sliding window — can't be accumulated, can't be delta-diffed
item_id matters — DashScope splits speech into items; text resets on item change
_merge_stash not needed — text is already cumulative; stash only needed for trailing chars
replace: true destroys cross-utterance text — simpler to just append with space separator
Stash on pause completes sentences — text lags ~200ms behind; stash fills the gap

7.3 KiB Raw Blame History