Speech-to-Speech
S2S WebSocket Stream
Stream audio in and audio out over a single Gradium speech-to-speech WebSocket: transcribe, optionally translate, and re-synthesize in real time.
Lifecycle
ready, then text and audio messages as
available, and finally end_of_stream. The protocol combines the STT
input side (audio in) with the TTS output side (text and audio
out). See WebSocket Lifecycle for
connection behavior, reusable sockets, browser tokens, and errors.
Client Messages
setup
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Always "setup". |
model_name | string | No | Model alias, defaults to "default". |
stt_model_name | string | No | Speech-to-text model used to transcribe the input. |
tts_model_name | string | No | Text-to-speech model used to synthesize the output. |
input_format | string | No | pcm, wav, opus, ulaw_8000, alaw_8000, or explicit PCM rates such as pcm_16000. Defaults to wav. |
output_format | string | No | wav, pcm, opus, ulaw_8000, etc. Defaults to wav. |
voice_id | string | No | Voice used for the synthesized output. See Voices. |
json_config | object or string | No | Advanced settings. Set target_language to translate the speech; omit it to keep the original language. |
client_req_id | string | No | Correlates multiplexed requests. |
close_ws_on_eos | boolean | No | Defaults to true; set false to keep the socket open. |
audio
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Always "audio". |
audio | string | Yes | Base64-encoded input audio chunk. |
client_req_id | string | No | Required when routing a multiplexed request. |
end_of_stream
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Always "end_of_stream". |
client_req_id | string | No | End the matching multiplexed request. |
Server Messages
ready
| Field | Type | Description |
|---|---|---|
type | string | Always "ready". |
request_id | string | Gradium request ID for logging and support. |
model_name | string | Requested model alias. |
sample_rate | integer | Output sample rate in Hz. |
frame_size | integer | Output frame size in samples. |
client_req_id | string | Present for multiplexed requests. |
text
| Field | Type | Description |
|---|---|---|
type | string | Always "text". |
text | string | Transcribed (and translated, if target_language is set) text segment. |
start_s | number | Segment start time in seconds. |
stop_s | number | Segment stop time in seconds. |
stream_id | integer | Stream identifier, when present. |
client_req_id | string | Present for multiplexed requests. |
audio
| Field | Type | Description |
|---|---|---|
type | string | Always "audio". |
audio | string | Base64-encoded output audio chunk. |
start_s | number | Chunk start time in seconds. |
stop_s | number | Chunk stop time in seconds. |
stream_id | integer | Stream identifier, when present. |
client_req_id | string | Present for multiplexed requests. |
Terminal messages
| Type | Description |
|---|---|
end_of_stream | The request is complete. |
error | Terminal error message; the socket closes after the error. |
Error
Headers
Your Gradium API key
Response
101
WebSocket connection established