flush for low-latency turn-taking. This guide covers streaming via
the Python SDK.
One-shot from a finished file?
For pre-recorded audio in hand, the REST POST endpoint is a single
HTTP call, see the REST guide.
Quickstart
The shortest path: open anstt_realtime context manager, push audio
chunks, and iterate to receive transcripts.
Choosing the SDK API
The Python SDK exposes two ways to stream STT over the WebSocket endpoint. Both talk towss://api.gradium.ai/api/speech/asr: choose
based on how your audio source is shaped.
| Use case | Method |
|---|---|
| You can iterate the audio end-to-end (file, in-memory buffer, finite generator) and only need the transcript | stt_stream |
| You’re producing audio live (microphone, telephony, agent loop) and want to react to VAD or flush mid-stream | stt_realtime |
stt_stream is a single async call that consumes your audio iterable
and returns an STTStream whose iter_text() yields
TextWithTimestamps segments. stt_realtime is an async context
manager: you push chunks with send_audio() and iterate the same
object to receive every message type, text, VAD step, end_text,
flushed.
Setup is a dict for stt_stream and keyword arguments for
stt_realtime. The wire-level setup message they produce is the
same.
Pull-based: stt_stream
Use stt_stream when the audio is already in hand, for example when
transcribing a file or a complete in-memory buffer. The function
consumes your audio iterable and yields finalised text segments.
iter_text() filters to text segments only and yields
TextWithTimestamps objects (text, start_s, stop_s). To observe
VAD or flushed messages, use stt_realtime below.
Real-time push: stt_realtime
Use stt_realtime when audio is being produced live (microphone,
telephony, conversational agent) and you need to react to VAD or flush
events as they arrive. It’s an async context manager, push audio with
send_audio() and iterate the same object to receive every message
type.
model_name, input_format,
json_config, …). See Setup Parameters below.
Setup Parameters
model_name: The STT model to use (default:"default").input_format: Audio format of the input. Supported values:"pcm": raw PCM, 24 kHz, 16-bit signed little-endian, mono. Send 1920-sample (80 ms) chunks for best results."pcm_8000","pcm_16000","pcm_22050","pcm_24000","pcm_44100","pcm_48000": raw PCM at the indicated sample rate (16-bit signed little-endian, mono)."wav": a valid WAV file with PCM data (AudioFormat = 1in the WAV header). 16-, 24-, and 32-bit samples are supported."opus": Ogg-wrapped Opus stream."ulaw_8000"(alias"mulaw_8000"): mu-law encoded PCM at 8 kHz."alaw_8000": A-law encoded PCM at 8 kHz.
Message Types
When iterating over anstt_realtime session you receive every
message the server sends. The relevant types for transcription:
text: a transcribed text segment with astart_stimestamp.end_text: finalises the previoustextsegment with astop_stimestamp. See Timestamps.step: Voice Activity Detection information emitted every 80 ms. See Voice Activity Detection (VAD).flushed: confirmation that all audio submitted before asend_flush()has been processed. See Flushing.end_of_stream: terminal message aftersend_eos().
stt_stream exposes only finalised text segments via iter_text(). To
observe step, end_text, or flushed messages, use stt_realtime.
Timestamps
Eachtext message carries a start_s field, the time in seconds
within the input audio at which the transcribed segment begins. When
the segment is finalised, the server emits an end_text message with
the corresponding stop_s:
text and end_text by stream_id to recover (text, start_s, stop_s) triples per segment.
Timestamps are emitted at segment granularity. For finer-grained
word or character offsets, derive them from the segment text.
Voice Activity Detection (VAD)
Gradium’s STT WebSocket emits astep message every 80 ms containing
semantic VAD predictions across multiple horizons: for each
horizon (in seconds) the model reports the probability that the
speaker will be silent by that point in the future. This is more
informative than a simple voiced/unvoiced flag and is designed for
turn-taking in conversational systems.
Detecting end of turn
The recommended starting point is to look at the 2 s horizon and trigger when the inactivity probability crosses 0.5:Adaptive Delay
Gradium STT is intentionally not a fixed-latency black box. Thedelay_in_frames option controls how much audio context the model
uses before emitting text. Each frame is 80 ms, and the server returns
the active value in the ready.delay_in_frames field.
Higher values give the model more context and can improve transcription
stability. Lower values reduce latency. Supported values are 7, 8,
10, 12, 14, 16, 20, 24, 32, 36, and 48.
At a turn boundary, pair semantic VAD with send_flush(): when your
app decides the speaker is done, flush asks Gradium to process audio
already submitted and then returns flushed with the matching
flush_id. Internally, the worker tracks the adjusted delay after
flush events so segment timestamps remain aligned with the audio that
was processed.
VAD step messages are emitted on the WebSocket transport.
Flushing
You can force the server to immediately process all buffered audio usingsend_flush(). This is useful when you need transcription results without waiting for the normal processing cadence, for example at a natural pause in conversation or before switching speakers.
The server will process all outstanding audio, return any pending text results, and then respond with a flushed message containing the same flush_id you provided.
Tunable options
Models accept additional configuration viajson_config, including
language, temp, padding_bonus, and delay_in_frames. Use
delay_in_frames as the adaptive delay control for your latency versus
quality tradeoff. See
Transcription Settings for the full
table and how to pass json_config across the SDK and REST.
Direct WebSocket
If you don’t want the gradium SDK, for example you’re calling from a non-Python runtime, or you want full control over the wire, you can talk to the WebSocket protocol directly. The message shapes are identical to what the SDK sends. For interactive poking from a terminal,wscat
is the quickest way to confirm reachability and step through messages
by hand:
websockets):
Next steps
STT WebSocket reference
Complete wire-level schema: every message type, every field, every
error code.
One-shot REST
Transcribe a complete audio file with a single HTTP request.
Transcription settings
language, temp, padding_bonus, delay_in_frames, and how to
pass json_config.Errors
Error contracts across REST, WebSocket, and streamed responses.