50 lines
1.7 KiB
Markdown
50 lines
1.7 KiB
Markdown
# Alibaba Cloud DashScope ASR — Reference Examples
|
|
|
|
Adapted from `/mnt/c/Users/woody/Documents/projects/voice input/` (Cantonese voice-to-text web app).
|
|
|
|
## Files
|
|
|
|
| File | What | Language |
|
|
|------|------|----------|
|
|
| `alibaba_asr_backend.py` | FastAPI WebSocket proxy to DashScope realtime ASR | Python |
|
|
| `alibaba_asr_frontend_vanilla.html` | Browser audio capture + WebSocket (vanilla JS) | HTML/JS |
|
|
| `alibaba_asr_frontend_react.tsx` | React/TS hook + component for audio capture | TypeScript/React |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Browser (Float32 PCM, 16kHz mono)
|
|
│ WebSocket: send(float32Data.buffer)
|
|
▼
|
|
FastAPI Backend (/ws/asr/{video_id})
|
|
│ Convert Float32 → S16_LE → base64
|
|
▼
|
|
Alibaba Cloud DashScope (wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime)
|
|
│ Model: qwen3-asr-flash-realtime
|
|
▼ Language: yue (Cantonese)
|
|
Transcript JSON → Browser
|
|
```
|
|
|
|
## Key Details
|
|
|
|
- **Audio format**: Float32 PCM, 16kHz, mono (browser) → S16_LE PCM, 16kHz, mono, base64 (DashScope)
|
|
- **Model**: `qwen3-asr-flash-realtime` (WebSocket realtime, unlimited duration)
|
|
- **Endpoint**: `wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`
|
|
- **SDK**: `pip install dashscope>=0.4.0`
|
|
- **Cantonese**: Language code `yue` (works natively with DashScope)
|
|
- **VAD**: Server-side (Alibaba Cloud handles voice activity detection)
|
|
- **Pricing**: ~$0.00009/second
|
|
- **Features**: Punctuation, ITN, filler word filtering, multi-language auto-detect
|
|
|
|
## Dependencies
|
|
|
|
```
|
|
# Python
|
|
dashscope>=0.4.0
|
|
openai>=1.52.0
|
|
zhconv>=1.4.0 # Simplified → Traditional Chinese (optional)
|
|
|
|
# No additional JS deps needed — native Web APIs only:
|
|
# WebSocket, AudioContext, ScriptProcessorNode, getUserMedia
|
|
```
|