legco_ai_assistant/.plans/phase2_enhancement_delta_ss...

7.3 KiB

Phase 2 Enhancement: Delta-based transcript via existing WebSocket

Created: 2026-05-07 Status: Complete Depends on: Phase 2 (Complete)


1. Goal

Replace full-text WebSocket responses with delta-only updates. Backend computes the difference between consecutive text fields and sends only new characters. Frontend appends them client-side. All over the existing WebSocket.

2. Why text field over stash

Stash log analysis revealed DashScope partial events have TWO text fields:

Field Behavior Description
stash Sliding window, ~7-20 chars Latest uncommitted characters
text Monotonically growing (within one item_id) Stable committed transcription

text is the stable, cumulative transcription. It grows monotonically within each utterance (item_id). Delta = text[len(prev_text):] — simple suffix diff, no merge logic needed.

stash is still useful: it contains trailing chars not yet committed to text. Sent alongside deltas so the frontend can append it on pause (completing the last sentence).

3. Architecture

DashScope partial event:
  {"text": "多謝主席咁啊", "stash": "咁啊亦都", "item_id": "item_ABC"}

Backend computes:
  delta = text[len(prev_text):]       → "咁啊" (new chars only)
  sends: {"delta":"咁啊", "stash":"咁啊亦都", "is_final":false}

Frontend:
  transcriptRef += msg.delta          → accumulates locally
  lastStashRef = msg.stash            → stores for pause handler

On pause:
  if stash not already at end of text:
    text += stash                     → "...可以诶处理埋诶呢啲余。"
  onFinalTranscript(text)             → persists in QueryInput

4. Key Design Decisions

4.1 Stash-only events skipped (text empty)

Early in each utterance, text is "" while stash has sliding window content. These are skipped — delta can't be computed from a sliding window. Once text starts populating (after ~1-2s), deltas stream.

4.2 Utterance boundary tracking

DashScope splits speech into item_id segments. When item_id changes, text resets to "". On detection:

  • prev_display set to " " (space) so the next utterance's full text is sent as delta, prepended with a space
  • Result: "...上一句 融資安排方面..." — utterances flow continuously

4.3 Trailing stash on pause

The text field lags behind — final chars of each utterance are only in stash. Every WS message includes the current stash. Frontend stores it. On pause, appends stash to text if not already overlapping.

5. Backend Implementation (ws_asr.py)

prev_display = ""
current_item_id = ""

async def read_events():
    nonlocal accumulated_text, prev_display, current_item_id
    while True:
        event = await event_queue.get()
        result = format_transcription_event(event, accumulated_text)
        if result is None:
            continue
        if result["is_final"]:
            # Completed utterance: send full accumulated text
            transcript = event.get("transcript", "")
            if transcript and transcript.strip():
                accumulated_text = build_display_text(accumulated_text, transcript)
            prev_display = ""
            result["delta"] = ""
            result["full_text"] = _to_traditional(accumulated_text)
        else:
            text = result.pop("text", "")
            stash = result.pop("stash", "")

            # Utterance boundary: item_id change resets text to empty
            item_id = event.get("item_id", "")
            if item_id and item_id != current_item_id:
                if prev_display:
                    prev_display = " "  # prepend space for next utterance
                current_item_id = item_id

            # text is monotonically growing within one utterance
            if text.strip():
                new_delta = ""
                if text != prev_display:
                    if prev_display and text.startswith(prev_display):
                        new_delta = text[len(prev_display):]
                    else:
                        new_delta = text
                    prev_display = text
                result["delta"] = _to_traditional(new_delta) if new_delta else ""
                result["full_text"] = ""
                result["stash"] = _to_traditional(stash) if stash.strip() else ""
            else:
                # text empty (new utterance starting) — skip
                continue
        await client_ws.send_json(result)

6. Frontend Implementation (useVideoASR.ts)

const transcriptRef = useRef('')
const lastStashRef = useRef('')

ws.onmessage = (e) => {
    const msg = JSON.parse(e.data)
    if (msg.is_final && msg.full_text) {
        transcriptRef.current = msg.full_text
        lastStashRef.current = ''
        setTranscript(msg.full_text)
        setPartialTranscript('')
        onFinalTranscriptRef.current?.(msg.full_text)
    } else if (msg.delta) {
        transcriptRef.current += msg.delta
        lastStashRef.current = msg.stash || ''
        setTranscript(transcriptRef.current)
        setPartialTranscript(transcriptRef.current)
    }
}

const stopStreaming = useCallback(() => {
    closeWebSocket()
    let text = transcriptRef.current.trim()
    const stash = lastStashRef.current.trim()
    if (stash && !text.endsWith(stash)) {
        text += stash
    }
    lastStashRef.current = ''
    if (text) {
        onFinalTranscriptRef.current?.(text)
        setPartialTranscript('')
    }
}, [closeWebSocket])

7. WebSocket Message Format

// Partial event (new chars only, with stash):
{"delta":"咁啊", "stash":"咁啊亦都", "full_text":"", "language":"yue", "is_final":false}

// Completed event (full accumulated text):
{"delta":"", "stash":"", "full_text":"多謝主席咁啊亦都...", "language":"yue", "is_final":true}

8. Files Changed

File Change
backend/app/routers/ws_asr.py Delta computation from text field, item_id tracking, stash passthrough
backend/app/test/test_phase2_ws_protocol.py Updated partial event tests for text/stash fields
frontend/src/hooks/useVideoASR.ts lastStashRef, delta accumulation, stash append on pause
frontend/src/types/index.ts Added stash?: string to ASRMessage

9. Edge Cases Handled

Case Handling
text empty (new utterance) Skip — delta not computable from sliding stash
item_id changes Reset prev_display, prepend space to next utterance
Trailing stash not in text Frontend appends on pause via lastStashRef
text doesn't start with prev_display Send entire text as delta (utterance boundary)
Multiple completed events Each resets prev_display and lastStash
Play again after pause New WS session, all state reset

10. Lessons Learned

  • text is the right field — monotonically growing, simple delta = suffix diff
  • stash is a sliding window — can't be accumulated, can't be delta-diffed
  • item_id matters — DashScope splits speech into items; text resets on item change
  • _merge_stash not neededtext is already cumulative; stash only needed for trailing chars
  • replace: true destroys cross-utterance text — simpler to just append with space separator
  • Stash on pause completes sentencestext lags ~200ms behind; stash fills the gap