Speech-to-Speech Overview

Gradium Speech-to-Speech (S2S) translates spoken audio from one language into spoken audio in another, over a single duplex WebSocket. The input audio is transcribed, translated into the target language, and re-synthesized as output audio. You stream audio in and receive both the synthesized output audio and the translated transcript back as they’re produced. Under the hood S2S chains the same models you use directly: a Speech-to-Text model on the input side, translation, and a Text-to-Speech model on the output side. You can pin the inner models with stt_model_name and tts_model_name, and you choose the output voice with voice_id (required, and it must be a voice in the target language).

When to use Speech-to-Speech

If you want to…	Use
Translate speech in one language into spoken audio in another, in real time	S2S with `json_config={"target_language": "..."}`
Only the transcript (no synthesized audio back)	Speech-to-Text
Only synthesized audio from text you already have	Text-to-Speech

Because S2S is a single duplex connection, you don’t have to wire STT and TTS together yourself or manage two sockets, the server runs the pipeline and streams results back as soon as they’re ready.

Choosing the SDK API

The Python SDK exposes three ways to drive S2S over the WebSocket. All three talk to wss://api.gradium.ai/api/speech/s2s; choose based on how your audio source is shaped and whether you need results incrementally.

Use case	Method
You’re producing audio live (microphone, telephony, agent loop) and want output audio and text as they arrive	`s2s_realtime`
You can iterate the audio end-to-end (file, in-memory buffer, finite generator) and want to stream the output back	`s2s_stream`
You have a complete audio file and just want the finished output audio plus transcript	`s2s` (buffered)

s2s_realtime is an async context manager: push chunks with send_audio() and iterate the same object to receive audio, text, and end_of_stream messages. s2s_stream consumes an audio iterable and returns an S2SStream whose iter_audio() yields output audio bytes (or iter_events() for every message). s2s buffers everything and returns an S2SResult with the full output audio and transcript.

What an S2S request shares

Models: one S2S model_name, plus optional stt_model_name and tts_model_name for the inner stages.
Input formats: PCM (multiple sample rates), WAV, Opus, mu-law, A-law. For pcm the input is 24 kHz, 16-bit signed mono.
Output formats: PCM, WAV, Opus, mu-law, A-law. For pcm the output is 48 kHz, 16-bit signed mono.
Voice: a required voice_id for the synthesized output. It must be a voice in the same language as target_language.
Translation target: target_language via json_config. See Speech-to-Speech Settings.

Next steps

Use the WebSocket API

Stream audio in, output audio and transcript out, real-time or buffered.

S2S settings

voice_id, target_language, inner model selection, formats.

S2S WebSocket reference

Complete wire-level schema: every message type, every field, every error code.

Voices

Pick the voice used for the synthesized output.

​When to use Speech-to-Speech

​Choosing the SDK API

​What an S2S request shares

​Next steps