Speech-to-Speech Settings

A Speech-to-Speech request translates spoken audio into another language. It is configured through the setup message: some fields select the pipeline (models, voice, formats), and the translation target is passed through json_config. In the Python SDK json_config is a dict; the SDK serializes it to a JSON string on the wire for you. These options apply to the WebSocket API across all three SDK entry points (s2s_realtime, s2s_stream, and the buffered s2s). For the STT-only and TTS-only knobs, see Transcription Settings and Voice Settings.

Setup fields

Field	Type	Required	Effect
`model_name`	string	Yes	S2S model alias. Defaults to `"default"`.
`stt_model_name`	string	No	Speech-to-text model used to transcribe the input.
`tts_model_name`	string	No	Text-to-speech model used to synthesize the output.
`voice_id`	string	Yes	Voice UID for the synthesized output. Must be a voice in the same language as `target_language`. See Voices.
`input_format`	string	Yes	`pcm`, `wav`, `opus`, `ulaw_8000`, `alaw_8000`, or an explicit PCM rate such as `pcm_24000`. For `pcm`, input is 24 kHz, 16-bit signed mono.
`output_format`	string	Yes	Same value set as `input_format`. For `pcm`, output is 48 kHz, 16-bit signed mono.
`json_config`	object or string	No	Advanced pipeline settings, see below.
`client_req_id`	string	No	Correlates multiplexed requests. See Multiplexing.
`close_ws_on_eos`	boolean	No	Defaults to `true`; set `false` to keep the socket open after `end_of_stream`.

`json_config` options

Option	Type	Allowed values	Effect
`target_language`	string	`"en"`, `"fr"`, `"de"`, `"es"`, `"pt"`	Language to translate the speech into before synthesis.

The transcribed text is translated into target_language before the output audio is generated, and the text messages you receive back carry the translated text. The voice_id you choose must be a voice in this same language.

Passing `json_config`

The same json_config payload is sent regardless of which SDK entry point you use; only the call shape differs:

config = {"target_language": "en"}

# Real-time: setup is keyword arguments; json_config stays nested.
async with client.s2s_realtime(
    model_name="default",
    input_format="pcm",
    output_format="pcm",
    voice_id="YTpq7expH9539ERJ",
    json_config=config,
) as s2s:
    ...

# Pull-based / buffered: setup is a dict containing json_config.
setup = {
    "model_name": "default",
    "input_format": "pcm",
    "output_format": "wav",
    "voice_id": "YTpq7expH9539ERJ",
    "json_config": config,
}
stream = await client.s2s_stream(setup, audio_generator(audio_data))

Next steps

S2S WebSocket guide

Real-time, pull-based, and buffered S2S with the Python SDK.

S2S WebSocket reference

Complete wire-level schema: every message type, every field, every error code.

Voices

Pick the voice used for the synthesized output.

Voice settings

TTS-side options for the synthesis stage.

​Setup fields

​json_config options

​Passing json_config

​Next steps