Skip to main content
Gradium exposes Speech-to-Text over two transports. They share the same models; pick the transport that matches your audio source and latency needs.

WebSocket vs REST

If your audio is…UseWhy
Live (microphone, telephony, agent loop) and you need sub-second turn-takingWebSocketLowest latency. Push audio as it’s captured, get text and VAD signals as they’re produced. React to end-of-turn in real time.
A complete file you already have on disk or in memoryRESTOne HTTP POST with the audio in the body. No connection to manage; the server streams NDJSON results back as the response body.
In hand but you want VAD signals or to react to flush eventsWebSocketUse stt_stream (pull-based) over the WebSocket. Same low latency, no manual connection management.
Browser microphone or mobile client audioBrowser WebSocketsUse short-lived tokens instead of exposing an API key.
Telephony media streamsTelephony AudioUse ulaw_8000, alaw_8000, or low-sample-rate PCM directly.
If you’re transcribing pre-recorded audio and don’t need VAD, REST is the simpler path. Move to WebSocket when you need live audio, turn-taking, or in-stream flush.

What both transports share

  • Models: same model_name works on both.
  • Input formats: PCM (multiple sample rates), WAV, Opus, mu-law, A-law.
  • Tunable options: temp, language, padding_bonus, delay_in_frames via json_config. See Transcription Settings.

What’s transport-specific

  • WebSocket-only: semantic VAD step messages every 80 ms, in-stream send_flush() for forced processing, adaptive delay controls (delay_in_frames), setup-message stream controls (send_setup_on_start, wait_for_ready_on_start), browser tokens. See WebSocket Lifecycle.
  • REST-only: send the full audio as the request body; receive NDJSON over a streaming response.

Next steps

Use the WebSocket API

Streaming audio in, transcripts and VAD signals out, with flush control.

Use the REST API

One-shot transcription of a complete audio file.

Turn-taking recipe

Use semantic VAD and adaptive delay to decide when an agent should answer.

Transcription settings

language, temp, padding_bonus, delay_in_frames.

Errors

Error contracts across REST, WebSocket, and streamed responses.