One-shot from a finished text block?
For simple “text in, file out” use the REST POST endpoint, a single
HTTP call with no WebSocket to manage.
Quickstart
Open a real-time TTS stream, send text, and read audio chunks as soon as the model produces them.Choosing the SDK API
The Python SDK exposes three TTS shapes. All WebSocket APIs talk towss://api.gradium.ai/api/speech/tts.
| Use case | Method |
|---|---|
| You are producing text live and want to send/receive concurrently | tts_realtime |
| You have a string, list, or async generator and want streamed audio chunks | tts_stream |
| You want the complete audio bytes after generation finishes | tts |
tts_realtime for voice agents, LLM token streaming, browser
bridges, telephony bridges, or any workflow where the sending and
receiving loops run at the same time. Use tts_stream when your input
is shaped like an iterable and you only need to consume bytes. Use
tts for scripts and tests where waiting for the full result is fine.
Buffered convenience
client.tts(...) still uses the streaming protocol under the hood; it
buffers the audio chunks and returns a single TTSResult.
Setup Parameters
model_name: The TTS model to use (default:"default")voice_id: The voice id of the voice to be used. The voice id can be found in the voice library section of this documentation or in the studio.output_format: Audio format of the output data (supported:"pcm","wav","opus", …)json_config: Advanced voice settings such as temperature, speed, voice similarity, and rewrite rules. See Voice Settings.pronunciation_id: Optional pronunciation dictionary to apply to this request.client_req_idandclose_ws_on_eos: Used for reusable and multiplexed WebSocket connections. See Multiplexing.
"pcm" output format, the audio will adhere to the following
specifications:
- Sample Rate: 48000 Hz (48kHz)
- Format: PCM (Pulse Code Modulation)
- Bit Depth: 16-bit signed integer
- Channels: Single channel (mono)
- Chunk Size: 3840 samples per chunk (80ms at 48kHz)
"wav" output format, the audio chunks are in WAV format,
at 48kHz, 16-bit signed integer mono.
When using the "opus" output format, the audio chunks use the Opus codec
wrapped in an Ogg container.
Alternative output formats include "ulaw_8000", "alaw_8000", "pcm_8000",
"pcm_16000", and "pcm_24000".
Streaming TTS
Usetts_stream when you have a string, list, or async generator and
want an async iterator of audio bytes.
Using Custom Voices
Output Formats
Flushing and Pauses
The model only generates audio when it has enough context to do so, so generally the audio lags a few words behind the text input. The<flush> tag can be
used to force the model to output the audio for all the text that has been
input so far.
Text with Timestamps
The model returns timestamps for each emitted text segment (typically aligned to word boundaries).text, start_s, and stop_s: the start and end
times in seconds within the generated audio. Over a streaming session
the same information is delivered as text messages on the wire (see
the TTS WebSocket reference).
Timestamps are emitted at segment granularity (typically aligned
to word boundaries). For finer-grained character offsets, derive them
from the segment text.
Async Generator Input
Use an async generator when text is produced incrementally, for example by an LLM or another streaming service.Split text chunks on whitespace, never in the middle of a word or
immediately before a punctuation mark. When text is sent incrementally
(whether through successive
send_text calls on the Python realtime
stream, through an async generator as shown above, or through multiple
{"type": "text"} messages over the WebSocket API), the server inserts
a single whitespace between the contents of consecutive messages.So sending "foo" followed by "bar" is equivalent to sending
"foo bar" (a whitespace is added between them), not "foobar".
Splitting a single word across two messages will change its pronunciation.For the same reason, do not split punctuation into its own message:
sending "foo" followed by "." produces "foo ." rather than "foo.".
Keep trailing punctuation attached to the preceding word (e.g. send
"foo." as one message, or "foo. " followed by the next chunk).Pronunciation Dictionaries
Pronunciation dictionaries allow you to customize how specific words or phrases are pronounced in your TTS output. This is particularly useful for:- Brand names, technical terms, or proper nouns
- Acronyms that should be pronounced in a specific way
- Words with non-standard pronunciations in your use case
pronunciation_id parameter in the setup message, similar to the way we pass the voice_id:
Direct WebSocket
If you don’t want the gradium SDK, for example you’re calling from a non-Python runtime, or you want full control over the wire, you can talk to the WebSocket protocol directly. The message shapes are identical to what the SDK sends. For interactive poking from a terminal,wscat
is the quickest way to confirm reachability and step through messages
by hand:
websockets):
Next steps
TTS WebSocket reference
Complete wire-level schema: every message type, every field, every
error code.
One-shot REST
Synthesise a complete text block in a single HTTP request.
Multiplexing
Run multiple concurrent synthesis requests over a single WebSocket
session.
Voice settings
Speed, temperature, voice similarity, and rewrite rules.