stt_model_name and tts_model_name, and you choose the output
voice with voice_id (required, and it must be a voice in the target
language).
When to use Speech-to-Speech
| If you want to… | Use |
|---|---|
| Translate speech in one language into spoken audio in another, in real time | S2S with json_config={"target_language": "..."} |
| Only the transcript (no synthesized audio back) | Speech-to-Text |
| Only synthesized audio from text you already have | Text-to-Speech |
Choosing the SDK API
The Python SDK exposes three ways to drive S2S over the WebSocket. All three talk towss://api.gradium.ai/api/speech/s2s; choose based on
how your audio source is shaped and whether you need results
incrementally.
| Use case | Method |
|---|---|
| You’re producing audio live (microphone, telephony, agent loop) and want output audio and text as they arrive | s2s_realtime |
| You can iterate the audio end-to-end (file, in-memory buffer, finite generator) and want to stream the output back | s2s_stream |
| You have a complete audio file and just want the finished output audio plus transcript | s2s (buffered) |
s2s_realtime is an async context manager: push chunks with
send_audio() and iterate the same object to receive audio, text,
and end_of_stream messages. s2s_stream consumes an audio iterable
and returns an S2SStream whose iter_audio() yields output audio
bytes (or iter_events() for every message). s2s buffers everything
and returns an S2SResult with the full output audio and transcript.
What an S2S request shares
- Models: one S2S
model_name, plus optionalstt_model_nameandtts_model_namefor the inner stages. - Input formats: PCM (multiple sample rates), WAV, Opus, mu-law,
A-law. For
pcmthe input is 24 kHz, 16-bit signed mono. - Output formats: PCM, WAV, Opus, mu-law, A-law. For
pcmthe output is 48 kHz, 16-bit signed mono. - Voice: a required
voice_idfor the synthesized output. It must be a voice in the same language astarget_language. - Translation target:
target_languageviajson_config. See Speech-to-Speech Settings.
Next steps
Use the WebSocket API
Stream audio in, output audio and transcript out, real-time or
buffered.
S2S settings
voice_id, target_language, inner model selection, formats.S2S WebSocket reference
Complete wire-level schema: every message type, every field, every
error code.
Voices
Pick the voice used for the synthesized output.