target_language, and re-synthesized with the voice you select. This
guide covers streaming via the Python SDK.
Not sure which API to use?
The overview compares
s2s_realtime, s2s_stream, and the buffered
s2s helper.Quickstart
The shortest path: open ans2s_realtime context manager, push audio
chunks, and iterate to receive output audio and text.
pcm is 48 kHz, 16-bit signed mono. Other formats are
supported, see Setup Parameters.
Choosing the SDK API
The Python SDK exposes three ways to drive S2S over the WebSocket endpoint. All talk towss://api.gradium.ai/api/speech/s2s: choose
based on how your audio source is shaped.
| Use case | Method |
|---|---|
| You’re producing audio live (microphone, telephony, agent loop) and want output as it arrives | s2s_realtime |
| You can iterate the audio end-to-end (file, in-memory buffer, finite generator) and want to stream output back | s2s_stream |
| You have a complete file and just want the finished output audio plus transcript | s2s (buffered) |
s2s_realtime and a dict for
s2s_stream / s2s. The wire-level setup message they produce is the
same.
Real-time push: s2s_realtime
Use s2s_realtime when audio is being produced live (microphone,
telephony, conversational agent) and you need output audio and text as
they arrive. It’s an async context manager, push audio with
send_audio() and iterate the same object to receive every message
type.
send_audio() accepts bytes or a 1-D numpy array (int16 or
float32) when input_format is a pcm variant. In the iterator,
audio messages already carry decoded bytes in msg["audio"].
Pull-based: s2s_stream
Use s2s_stream when the audio is already in hand (a file or a
complete in-memory buffer) but you still want to stream the output back
as it’s produced. The function consumes your audio iterable and returns
an S2SStream.
iter_audio() yields output audio bytes and collects text segments
along the way. To handle every server message yourself (both audio
and text as dicts, with audio fields decoded), use iter_events()
instead.
Buffered: s2s
Use s2s when you don’t need results incrementally, hand it a complete
audio buffer and get back an S2SResult with the full output audio and
transcript.
audio can be raw bytes, a 1-D numpy array (int16/float32, pcm
input only, sample_rate=24000 required), or an async generator of
byte chunks. The returned S2SResult exposes raw_data,
sample_rate, output_format, request_id, text, and
text_with_timestamps; pcm16() / pcm() convert PCM output to numpy
arrays.
Setup Parameters
model_name: The S2S model to use (default:"default").stt_model_name: Speech-to-text model used to transcribe the input. Optional.tts_model_name: Text-to-speech model used to synthesize the output. Optional.voice_id(required): Voice UID used for the synthesized output. It must be a voice in the same language astarget_language. See Voices.input_format: Audio format of the input. Supported values:"pcm": raw PCM, 24 kHz, 16-bit signed little-endian, mono. Send 1920-sample (80 ms) chunks for best results."pcm_8000","pcm_16000","pcm_22050","pcm_24000","pcm_44100","pcm_48000": raw PCM at the indicated sample rate (16-bit signed little-endian, mono)."wav": a valid WAV file with PCM data."opus": Ogg-wrapped Opus stream."ulaw_8000"(alias"mulaw_8000"): mu-law encoded PCM at 8 kHz."alaw_8000": A-law encoded PCM at 8 kHz.
output_format: Audio format of the output. Same value set asinput_format. For"pcm"the output is 48 kHz, 16-bit signed little-endian, mono.json_config: Settarget_language(e.g."en") to the language to translate the speech into. Thevoice_idmust be a voice in this language. See Speech-to-Speech Settings.
Message Types
When iterating over ans2s_realtime session you receive every message
the server sends:
ready: connection accepted; carriessample_rateandframe_sizeof the output. Available vias2s.readywhen you passwait_for_ready_on_start=True.text: a translated transcript segment (intarget_language) withstart_s/stop_stimestamps.audio: an output audio chunk. In the SDK iterator theaudiofield is already decodedbytes.end_of_stream: terminal message aftersend_eos().
Direct WebSocket
If you don’t want the gradium SDK, for example you’re calling from a non-Python runtime, or you want full control over the wire, you can talk to the WebSocket protocol directly. The message shapes are identical to what the SDK sends. For interactive poking from a terminal,wscat
is the quickest way to confirm reachability and step through messages
by hand:
websockets):
Next steps
S2S WebSocket reference
Complete wire-level schema: every message type, every field, every
error code.
S2S settings
voice_id, target_language, inner model selection, formats.Voices
Pick the voice used for the synthesized output.
Errors
Error contracts across REST, WebSocket, and streamed responses.