legco_ai_assistant/.plans/phase2_enhancement_use_text...

4.1 KiB

Phase 2 Enhancement: Use text field instead of stash

Created: 2026-05-07 Status: Planning Depends on: Phase 2 (Complete)


1. Discovery

Stash log analysis revealed DashScope partial events contain TWO text fields:

Field Behavior Description
stash Sliding window, ~7-20 chars, replaces on each event Latest characters recognized (raw ASR output)
text Monotonically growing, never shrinks Stable cumulative transcription of current utterance
Event sequence example:
  stash="多"               text=""             ← text empty early on
  stash="多谢"              text=""
  stash="多谢主席咁啊"        text=""
  stash="主席咁啊,亦"        text=""             ← stash slides, text still empty
  stash="咁啊,亦都多谢"      text="多谢主席咁啊亦"  ← text starts populating
  stash="都多谢邱主任"        text="多谢主席咁啊亦都多谢邱主任头先..."
  stash="point嘅详细介绍"    text="多谢主席咁啊亦都多谢邱主任头先个诶powerpoint嘅详细介绍..."

text grows monotonically — it's the stable transcription. stash slides as new audio arrives.

2. Why This Is Better

  • No _merge_stash neededtext is already cumulative per utterance
  • No overlap detection — characters never change once set
  • No risk of wrong merging — stashes sometimes overlap incorrectly (sliding window may lose context)
  • Simpler code — less logic, less surface area for bugs

3. Changes Required

3.1 Backend: format_transcription_event (ws_asr.py)

# BEFORE: extract stash field
if event_type == "...transcription.text":
    stash = event.get("stash", "")
    return {"delta": "", "stash": stash, ...}

# AFTER: extract text field
if event_type == "...transcription.text":
    text = event.get("text", "")
    return {"delta": "", "text": text, ...}

3.2 Backend: read_events (ws_asr.py)

# BEFORE: merge stashes
else:
    stash = result.pop("stash", "")
    if stash.strip():
        partial_buffer = _merge_stash(partial_buffer, stash)
    display = build_display_text(accumulated_text, partial_buffer)

# AFTER: use text directly (already cumulative)
else:
    text = result.pop("text", "")
    if text.strip():
        partial_buffer = text       # text already cumulative
    display = build_display_text(accumulated_text, partial_buffer)

3.3 Backend: Remove _merge_stash function

No longer needed.

3.4 Backend: Tests (test_phase2_ws_protocol.py)

  • Replace TestMergeStash class with TestTextFieldFormatting
  • Update partial event tests to use text field instead of stash
  • Verify monotonic growth (text never shrinks character-by-character)

3.5 Backend: Stash log format

Update log to capture both fields for future debugging:

_stash_logger.info(
    "seq=%d elapsed_ms=%d stash_len=%d text_len=%d stash=%r text=%r ...",
    ...
)

4. What Does NOT Change

  • Frontend (useVideoASR.ts) — already handles full_text correctly, no changes needed
  • Frontend (QueryInput.tsx) — unchanged
  • Pause/stop logic — unchanged
  • Completed event handling — unchanged (completed events already use transcript field)
  • partial_buffer variable — still used, just populated from text instead of merged stashes

5. Files Changed

File Change
backend/app/routers/ws_asr.py Remove _merge_stash(), use text field, update stash logging
backend/app/test/test_phase2_ws_protocol.py Replace merge tests with text-field tests

6. Acceptance Criteria

  • text field used for partial events instead of stash
  • _merge_stash function removed
  • Text displayed in QueryInput grows monotonically (no jumping/replacing)
  • All 16 ws_protocol tests pass (updated)
  • Text persists on pause (existing behavior, unchanged)
  • Stash log captures both stash and text fields for reference

7. Rollback Risk

Low. Only 2 files changed, only backend. Frontend untouched. If text field behaves unexpectedly, revert to _merge_stash approach (already committed).