Speech-to-Speech (WebSocket)

Gradium Speech-to-Speech (S2S) streams input audio to the server over a single duplex WebSocket and streams the translated output audio plus its transcript back. The audio is transcribed, translated into target_language, and re-synthesized with the voice you select. This guide covers streaming via the Python SDK.

Not sure which API to use?

The overview compares s2s_realtime, s2s_stream, and the buffered s2s helper.

Quickstart

The shortest path: open an s2s_realtime context manager, push audio chunks, and iterate to receive output audio and text.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async with client.s2s_realtime(
        model_name="default",
        input_format="pcm",
        output_format="pcm",
        voice_id="YTpq7expH9539ERJ",  # required; must be a voice in the target language
        json_config={"target_language": "en"},
        wait_for_ready_on_start=True,
    ) as s2s:
        # Producer: push 80 ms PCM frames as they're captured.
        async def producer():
            async for chunk in microphone_chunks():  # 1920-sample frames
                await s2s.send_audio(chunk)
            await s2s.send_eos()

        # Consumer: collect output audio, print the transcript as it arrives.
        out = []
        async def consumer():
            async for msg in s2s:
                if msg["type"] == "audio":
                    out.append(msg["audio"])  # already-decoded bytes
                elif msg["type"] == "text":
                    print(msg["text"], end=" ", flush=True)
                elif msg["type"] == "end_of_stream":
                    return

        await asyncio.gather(producer(), consumer())

asyncio.run(main())

Expected input: 24 kHz, 16-bit signed mono PCM in 1920-sample (80 ms) chunks. Output pcm is 48 kHz, 16-bit signed mono. Other formats are supported, see Setup Parameters.

Choosing the SDK API

The Python SDK exposes three ways to drive S2S over the WebSocket endpoint. All talk to wss://api.gradium.ai/api/speech/s2s: choose based on how your audio source is shaped.

Use case	Method
You’re producing audio live (microphone, telephony, agent loop) and want output as it arrives	`s2s_realtime`
You can iterate the audio end-to-end (file, in-memory buffer, finite generator) and want to stream output back	`s2s_stream`
You have a complete file and just want the finished output audio plus transcript	`s2s` (buffered)

Setup is keyword arguments for s2s_realtime and a dict for s2s_stream / s2s. The wire-level setup message they produce is the same.

Real-time push: `s2s_realtime`

Use s2s_realtime when audio is being produced live (microphone, telephony, conversational agent) and you need output audio and text as they arrive. It’s an async context manager, push audio with send_audio() and iterate the same object to receive every message type.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async with client.s2s_realtime(
        model_name="default",
        input_format="pcm",
        output_format="pcm",
        voice_id="YTpq7expH9539ERJ",  # required; must be a voice in the target language
        json_config={"target_language": "en"},
        wait_for_ready_on_start=True,
    ) as s2s:
        print("Ready:", s2s.ready)

        async def producer():
            async for chunk in microphone_chunks():  # 80 ms PCM frames
                await s2s.send_audio(chunk)
            await s2s.send_eos()

        out = []
        async def consumer():
            async for msg in s2s:
                if msg["type"] == "audio":
                    out.append(msg["audio"])
                elif msg["type"] == "text":
                    print(msg["text"], end=" ", flush=True)
                elif msg["type"] == "end_of_stream":
                    return

        await asyncio.gather(producer(), consumer())

if __name__ == "__main__":
    asyncio.run(main())

send_audio() accepts bytes or a 1-D numpy array (int16 or float32) when input_format is a pcm variant. In the iterator, audio messages already carry decoded bytes in msg["audio"].

Pull-based: `s2s_stream`

Use s2s_stream when the audio is already in hand (a file or a complete in-memory buffer) but you still want to stream the output back as it’s produced. The function consumes your audio iterable and returns an S2SStream.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async def audio_generator(audio_data, chunk_size=1920):
        for i in range(0, len(audio_data), chunk_size):
            yield audio_data[i : i + chunk_size]

    setup = {
        "model_name": "default",
        "input_format": "pcm",
        "output_format": "wav",
        "voice_id": "YTpq7expH9539ERJ",
        "json_config": {"target_language": "en"},
    }

    stream = await client.s2s_stream(setup, audio_generator(audio_data))

    with open("output.wav", "wb") as f:
        async for chunk in stream.iter_audio():
            f.write(chunk)

    # Text segments are collected as audio is consumed.
    for twt in stream._text_with_timestamps:
        print(f"[{twt.start_s:.2f}s] {twt.text}")

if __name__ == "__main__":
    asyncio.run(main())

iter_audio() yields output audio bytes and collects text segments along the way. To handle every server message yourself (both audio and text as dicts, with audio fields decoded), use iter_events() instead.

Buffered: `s2s`

Use s2s when you don’t need results incrementally, hand it a complete audio buffer and get back an S2SResult with the full output audio and transcript.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    setup = {
        "model_name": "default",
        "input_format": "wav",
        "output_format": "wav",
        "voice_id": "YTpq7expH9539ERJ",
        "json_config": {"target_language": "en"},
    }

    with open("speech.wav", "rb") as f:
        result = await client.s2s(setup, f.read())

    print(result.text)
    with open("translated.wav", "wb") as f:
        f.write(result.raw_data)

if __name__ == "__main__":
    asyncio.run(main())

audio can be raw bytes, a 1-D numpy array (int16/float32, pcm input only, sample_rate=24000 required), or an async generator of byte chunks. The returned S2SResult exposes raw_data, sample_rate, output_format, request_id, text, and text_with_timestamps; pcm16() / pcm() convert PCM output to numpy arrays.

Setup Parameters

model_name: The S2S model to use (default: "default").
stt_model_name: Speech-to-text model used to transcribe the input. Optional.
tts_model_name: Text-to-speech model used to synthesize the output. Optional.
voice_id (required): Voice UID used for the synthesized output. It must be a voice in the same language as target_language. See Voices.
input_format: Audio format of the input. Supported values:
- "pcm": raw PCM, 24 kHz, 16-bit signed little-endian, mono. Send 1920-sample (80 ms) chunks for best results.
- "pcm_8000", "pcm_16000", "pcm_22050", "pcm_24000", "pcm_44100", "pcm_48000": raw PCM at the indicated sample rate (16-bit signed little-endian, mono).
- "wav": a valid WAV file with PCM data.
- "opus": Ogg-wrapped Opus stream.
- "ulaw_8000" (alias "mulaw_8000"): mu-law encoded PCM at 8 kHz.
- "alaw_8000": A-law encoded PCM at 8 kHz.
output_format: Audio format of the output. Same value set as input_format. For "pcm" the output is 48 kHz, 16-bit signed little-endian, mono.
json_config: Set target_language (e.g. "en") to the language to translate the speech into. The voice_id must be a voice in this language. See Speech-to-Speech Settings.

Unrecognised format values raise a server error.

Message Types

When iterating over an s2s_realtime session you receive every message the server sends:

ready: connection accepted; carries sample_rate and frame_size of the output. Available via s2s.ready when you pass wait_for_ready_on_start=True.
text: a translated transcript segment (in target_language) with start_s / stop_s timestamps.
audio: an output audio chunk. In the SDK iterator the audio field is already decoded bytes.
end_of_stream: terminal message after send_eos().

For the wire-level schema (full field types and on-the-wire JSON), see the S2S WebSocket reference.

Direct WebSocket

If you don’t want the gradium SDK, for example you’re calling from a non-Python runtime, or you want full control over the wire, you can talk to the WebSocket protocol directly. The message shapes are identical to what the SDK sends. For interactive poking from a terminal, wscat is the quickest way to confirm reachability and step through messages by hand:

wscat -c "wss://api.gradium.ai/api/speech/s2s" \
  -H "x-api-key: your_api_key"
# After connection, type:
# {"type":"setup","model_name":"default","input_format":"pcm","output_format":"pcm","voice_id":"YTpq7expH9539ERJ","json_config":{"target_language":"en"}}

For a client without the SDK, use any WebSocket library (here Python’s websockets):

import asyncio
import base64
import json

import websockets

CHUNK_BYTES = 1920 * 2  # 80 ms at 24 kHz, 16-bit mono.


async def translate(api_key: str, pcm_audio: bytes):
    setup = {
        "type": "setup",
        "model_name": "default",
        "input_format": "pcm",
        "output_format": "pcm",
        "voice_id": "YTpq7expH9539ERJ",
        "json_config": {"target_language": "en"},
    }

    async with websockets.connect(
        "wss://api.gradium.ai/api/speech/s2s",
        additional_headers={"x-api-key": api_key},
    ) as ws:
        await ws.send(json.dumps(setup))
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"

        async def producer():
            for off in range(0, len(pcm_audio), CHUNK_BYTES):
                chunk = pcm_audio[off : off + CHUNK_BYTES]
                await ws.send(json.dumps({
                    "type": "audio",
                    "audio": base64.b64encode(chunk).decode(),
                }))
            await ws.send(json.dumps({"type": "end_of_stream"}))

        out = []
        async def consumer():
            while True:
                msg = json.loads(await ws.recv())
                if msg["type"] == "audio":
                    out.append(base64.b64decode(msg["audio"]))
                elif msg["type"] == "text":
                    print(msg["text"])
                elif msg["type"] == "end_of_stream":
                    return
                elif msg["type"] == "error":
                    raise RuntimeError(msg["message"])

        await asyncio.gather(producer(), consumer())

asyncio.run(translate("your_api_key", open("input.pcm", "rb").read()))

For the full message schema (every field, every error code), see the S2S WebSocket reference.

Next steps

S2S WebSocket reference

Complete wire-level schema: every message type, every field, every error code.

S2S settings

voice_id, target_language, inner model selection, formats.

Voices

Pick the voice used for the synthesized output.

Errors

Error contracts across REST, WebSocket, and streamed responses.

Not sure which API to use?

​Quickstart

​Choosing the SDK API

​Real-time push: s2s_realtime

​Pull-based: s2s_stream

​Buffered: s2s

​Setup Parameters

​Message Types

​Direct WebSocket

​Next steps