Skip to main content

Basic Usage

import gradium

client = gradium.client.GradiumClient()
result = await client.tts(
    setup={
        "model_name": "default",
        "voice_id": "YTpq7expH9539ERJ",
        "output_format": "wav"
    },
    text="Hello, world!"
)

with open("output.wav", "wb") as f:
    f.write(result.raw_data)

print(f"Sample rate: {result.sample_rate}")
print(f"Request ID: {result.request_id}")

Setup Parameters

  • model_name: The TTS model to use (default: "default")
  • voice_id: The voice id of the voice to be used. The voice id can be found in the voice library section of this documentation or in the studio.
  • output_format: Audio format of the output data (supported: "pcm", "wav", "opus", …)
When using "pcm" output format, the audio will adhere to the following specifications:
  • Sample Rate: 48000 Hz (48kHz)
  • Format: PCM (Pulse Code Modulation)
  • Bit Depth: 16-bit signed integer
  • Channels: Single channel (mono)
  • Chunk Size: 3840 samples per chunk (80ms at 48kHz)
When using the "wav" output format, the audio chunks are in WAV format, at 48kHz, 16-bit signed integer mono. When using the "opus" output format, the audio chunks use the Opus codec wrapped in an Ogg container. Alternative output formats include "ulaw_8000", "alaw_8000", "pcm_8000", "pcm_16000", and "pcm_24000".

Streaming TTS

The TTS can be used in a streaming fashion. The first chunks of audio will be available as soon as they are generated.
stream = await client.tts_stream(
    setup={
        "model_name": "default",
        "voice_id": "LFZvm12tW_z0xfGo",
        "output_format": "pcm"
    },
    text="This is a longer text that will be streamed."
)

async for audio_chunk in stream.iter_bytes():
    print(f"Received {len(audio_chunk)} bytes")

Using Custom Voices

result = await client.tts(
    setup={
        "model_name": "default",
        "voice_id": "YTpq7expH9539ERJ",
        "output_format": "wav"
    },
    text="Hello with my custom voice!"
)

Output Formats

# WAV format
result = await client.tts(setup={"voice_id": "YTpq7expH9539ERJ", "output_format": "wav"}, text="Hello")

# PCM format: the data is sampled at 48kHz, 16-bit signed integer, mono
result = await client.tts(setup={"voice_id": "YTpq7expH9539ERJ", "output_format": "pcm"}, text="Hello")

# Get numpy array from PCM
pcm_array = result.pcm()
pcm16_array = result.pcm16()

Flushing and Pauses

The model only generates audio when it has enough context to do so, so generally the audio lags a few words behind the text input. The <flush> tag can be used to force the model to output the audio for all the text that has been input so far.
sample_text = "Hello, this is a test from the Gradium Text to Speech system. <flush> We are testing the flush."
test_audio = await client.tts(
    setup={'voice_id': 'YTpq7expH9539ERJ', 'output_format': 'wav'},
    text=sample_text,
)
Pauses can be generated by inserting a “break time” tag as shown below. The break time is specified in seconds and should be between 0.1 and 2.0s. The tag must be preceded and followed by a space.
sample_text = 'Hello, this is a test from the Gradium Text to Speech system. <break time="1.5s" /> We are testing the pause.'

test_audio = await client.tts(
    setup={'voice_id': 'YTpq7expH9539ERJ', 'output_format': 'wav'},
    text=sample_text,
)

Text with Timestamps

The model also returns word-level timestamps for the generated audio.
result = await client.tts(
    setup={"voice_id": "YTpq7expH9539ERJ", "output_format": "wav"},
    text="Hello, world!"
)

for item in result.text_with_timestamps:
    print(f"{item.text}: {item.start_s:.2f}s - {item.stop_s:.2f}s")

Async Generator Input

async def text_generator():
    yield "Hello, "
    yield "this is "
    yield "a streaming "
    yield "example."

stream = await client.tts_stream(
    setup={"voice_id": "YTpq7expH9539ERJ", "output_format": "pcm"},
    text=text_generator()
)

async for chunk in stream.iter_bytes():
    pass

Pronunciation Dictionaries

Pronunciation dictionaries allow you to customize how specific words or phrases are pronounced in your TTS output. This is particularly useful for:
  • Brand names, technical terms, or proper nouns
  • Acronyms that should be pronounced in a specific way
  • Words with non-standard pronunciations in your use case
The easiest way to create and manage pronunciation dictionaries is through the Gradium Studio, on the pronunciation page. Once you have created a dictionary and obtained its ID, you can use it in your TTS requests by passing the pronunciation_id parameter in the setup message, similar to the way we pass the voice_id:
import gradium

client = gradium.client.GradiumClient()

result = await client.tts(
    setup={
        "voice_id": "YTpq7expH9539ERJ",
        "output_format": "wav",
        "pronunciation_id": "bb1ckYhNHCcIJjdK",  # Whatever your ID is.
    },
    text="The text you want to generate."
)

with open("output.wav", "wb") as f:
    f.write(result.raw_data)