FAQ

What language do you support?

We currently support English, French, Spanish, Portuguese, and German, with more languages currently in development. Sign up to be updated when more languages become available!

What is the maximum session duration?

A session can last up to 300 seconds. If you want to generate longer chunks of text or transcribe longer audio, it’s better to split it into different sessions. When using the free tier, there is an additional limitation of 1500 characters per session.

How many credits do I need?

For Text-to-Speech, one minute of audio typically requires 750 characters, which corresponds to 750 credits. This may vary depending on the language and the speaker’s pace. For Speech-to-Text, each second of audio costs 3 credits.

Can I use my own voice?

Yes! We offer two levels of voice cloning. Our standard feature allows you to create a high-quality clone with just a 10-second audio sample. For those seeking even higher fidelity, we now offer Pro Voice Clones; by providing 30 minutes to a few hours of audio data, you can create a custom voice with unmatched quality and nuance.

What is an Instant Voice Clone?

Instant voice cloning enables you to create realistic digital replicas using just a few seconds of reference audio (typically 10 seconds). Depending on your subscription plan, you can clone up to 1,000 voices. Please note that explicit consent from the voice owner is required.

What is a Pro Voice Clone?

A Pro Voice Clone is a high-fidelity, hyper-realistic voice model created by fine-tuning our dedicated AI on a large dataset of your audio. Unlike standard cloning, this process captures the speaker’s deepest emotional nuances, unique accents, and natural pacing, resulting in a digital voice indistinguishable from the original.

How do I generate a Pro Voice Clone?

To get started, navigate to the Pro Voice Clone tab in Gradium Studio and upload your audio dataset. You will receive a notification once the upload is processed. After training is complete, the voice will appear in your library and be ready for Text-to-Speech (TTS) generation.

How much audio do I need for a voice clone?

For an Instant Voice Clone, we get optimal results with only 10 seconds of data. For a Pro Voice Clone, we require a minimum of 30 minutes of clean audio data. For optimal results where the voice captures full emotional range and stability, we recommend providing 2 hours of audio.

How do I enable Zero Data Retention?

Zero Data Retention is available on our paid plans. To enable it, sign in to Gradium Studio, open the Profile dropdown in the top right, and select My organization. Under Privacy Settings, disable data retention. Once turned off, Gradium will no longer retain your request or response payloads.

Getting Started

Text-to-Speech

Speech-to-Text

Speech-to-Speech

Shared

Real-time Recipes

Migrations

Voices

Resources

What language do you support?

What is the maximum session duration?

How many credits do I need?

Can I use my own voice?

What is an Instant Voice Clone?

What is a Pro Voice Clone?

How do I generate a Pro Voice Clone?

How much audio do I need for a voice clone?

How do I enable Zero Data Retention?

​What language do you support?

​What is the maximum session duration?

​How many credits do I need?

​Can I use my own voice?

​What is an Instant Voice Clone?

​What is a Pro Voice Clone?

​How do I generate a Pro Voice Clone?

​How much audio do I need for a voice clone?

​How do I enable Zero Data Retention?

What language do you support?

What is the maximum session duration?

How many credits do I need?

Can I use my own voice?

What is an Instant Voice Clone?

What is a Pro Voice Clone?

How do I generate a Pro Voice Clone?

How much audio do I need for a voice clone?

How do I enable Zero Data Retention?