Basic Usage
Setup Parameters
model_name: The TTS model to use (default:"default")voice_id: The voice id of the voice to be used. The voice id can be found in the voice library section of this documentation or in the studio.output_format: Audio format of the output data (supported:"pcm","wav","opus", …)
"pcm" output format, the audio will adhere to the following
specifications:
- Sample Rate: 48000 Hz (48kHz)
- Format: PCM (Pulse Code Modulation)
- Bit Depth: 16-bit signed integer
- Channels: Single channel (mono)
- Chunk Size: 3840 samples per chunk (80ms at 48kHz)
"wav" output format, the audio chunks are in WAV format,
at 48kHz, 16-bit signed integer mono.
When using the "opus" output format, the audio chunks use the Opus codec
wrapped in an Ogg container.
Alternative output formats include "ulaw_8000", "alaw_8000", "pcm_8000",
"pcm_16000", and "pcm_24000".
Streaming TTS
The TTS can be used in a streaming fashion. The first chunks of audio will be available as soon as they are generated.Using Custom Voices
Output Formats
Flushing and Pauses
The model only generates audio when it has enough context to do so, so generally the audio lags a few words behind the text input. The<flush> tag can be
used to force the model to output the audio for all the text that has been
input so far.
Text with Timestamps
The model also returns word-level timestamps for the generated audio.Async Generator Input
Pronunciation Dictionaries
Pronunciation dictionaries allow you to customize how specific words or phrases are pronounced in your TTS output. This is particularly useful for:- Brand names, technical terms, or proper nouns
- Acronyms that should be pronounced in a specific way
- Words with non-standard pronunciations in your use case
pronunciation_id parameter in the setup message, similar to the way we pass the voice_id: