WebSocket vs REST
| If your audio is… | Use | Why |
|---|---|---|
| Live (microphone, telephony, agent loop) and you need sub-second turn-taking | WebSocket | Lowest latency. Push audio as it’s captured, get text and VAD signals as they’re produced. React to end-of-turn in real time. |
| A complete file you already have on disk or in memory | REST | One HTTP POST with the audio in the body. No connection to manage; the server streams NDJSON results back as the response body. |
| In hand but you want VAD signals or to react to flush events | WebSocket | Use stt_stream (pull-based) over the WebSocket. Same low latency, no manual connection management. |
| Browser microphone or mobile client audio | Browser WebSockets | Use short-lived tokens instead of exposing an API key. |
| Telephony media streams | Telephony Audio | Use ulaw_8000, alaw_8000, or low-sample-rate PCM directly. |
flush.
What both transports share
- Models: same
model_nameworks on both. - Input formats: PCM (multiple sample rates), WAV, Opus, mu-law, A-law.
- Tunable options:
temp,language,padding_bonus,delay_in_framesviajson_config. See Transcription Settings.
What’s transport-specific
- WebSocket-only: semantic VAD
stepmessages every 80 ms, in-streamsend_flush()for forced processing, adaptive delay controls (delay_in_frames), setup-message stream controls (send_setup_on_start,wait_for_ready_on_start), browser tokens. See WebSocket Lifecycle. - REST-only: send the full audio as the request body; receive NDJSON over a streaming response.
Next steps
Use the WebSocket API
Streaming audio in, transcripts and VAD signals out, with flush
control.
Use the REST API
One-shot transcription of a complete audio file.
Turn-taking recipe
Use semantic VAD and adaptive delay to decide when an agent should answer.
Transcription settings
language, temp, padding_bonus, delay_in_frames.Errors
Error contracts across REST, WebSocket, and streamed responses.