From 64a7a8a46b724dbb6197835c772e11869a9701c4 Mon Sep 17 00:00:00 2001 From: Woody Date: Thu, 14 May 2026 20:26:17 +0800 Subject: [PATCH] chore: add pnpm lockfiles, Phase 4 plan, and dev plan status update --- .plans/phase4_system_audio_plan.md | 453 +++ development_plan.md | 10 +- frontend/package-lock.json | 17 + frontend/package.json | 1 + frontend/pnpm-lock.yaml | 4528 ++++++++++++++++++++++++++++ package-lock.json | 28 + package.json | 5 + 7 files changed, 5038 insertions(+), 4 deletions(-) create mode 100644 .plans/phase4_system_audio_plan.md create mode 100644 frontend/pnpm-lock.yaml create mode 100644 package-lock.json create mode 100644 package.json diff --git a/.plans/phase4_system_audio_plan.md b/.plans/phase4_system_audio_plan.md new file mode 100644 index 0000000..ad31065 --- /dev/null +++ b/.plans/phase4_system_audio_plan.md @@ -0,0 +1,453 @@ +# Phase 4: System Audio Capture → ASR → RAG — Implementation Plan + +**Created:** 2026-05-09 +**Updated:** 2026-05-09 +**Status:** 📋 Draft (Not Started) +**Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete) + +--- + +## 1. Overview + +Phase 4 adds **system audio capture** as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline. + +**Use cases:** +- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio) +- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG +- Transcribing any audio playing on the computer without needing to download files + +### How It Works + +``` +User clicks "System Audio" → clicks "Start Capture" + → Browser shows permission dialog (screen/tab picker) + → User selects tab/window/screen (with audio) + → getDisplayMedia() returns MediaStream (with audio track) + → AudioContext.createMediaStreamSource(stream) + → ScriptProcessorNode (Float32 PCM, mono 16kHz) + → WebSocket → FastAPI → DashScope realtime ASR + → transcript → QueryInput → RAG Pipeline +``` + +### Audio Routing (vs Existing Sources) + +| Source | Audio Input | SourceNode Type | Start/Stop Trigger | +|--------|-------------|-----------------|-------------------| +| Upload | `