We currently support English, French, Spanish, Portuguese, and German, with more
languages currently in development. Sign up to be updated when more languages
become available!
A session can last up to 300 seconds. If you want to generate longer chunks of
text or transcribe longer audio, it’s better to split it into different
sessions.When using the free tier, there is an additional limitation of 1500 characters
per session.
For Text-to-Speech, one minute of audio typically requires 750 characters,
which corresponds to 750 credits. This may vary depending on the language and
the speaker’s pace. For Speech-to-Text, each second of audio costs 3 credits.
Yes! We offer two levels of voice cloning. Our standard feature allows you to
create a high-quality clone with just a 10-second audio sample. For those
seeking even higher fidelity, we now offer Pro Voice Clones; by providing 30
minutes to a few hours of audio data, you can create a custom voice with
unmatched quality and nuance.
Instant voice cloning enables you to create realistic digital replicas using just
a few seconds of reference audio (typically 10 seconds). Depending on your
subscription plan, you can clone up to 1,000 voices. Please note that explicit
consent from the voice owner is required.
A Pro Voice Clone is a high-fidelity, hyper-realistic voice model created by
fine-tuning our dedicated AI on a large dataset of your audio. Unlike standard
cloning, this process captures the speaker’s deepest emotional nuances, unique
accents, and natural pacing, resulting in a digital voice indistinguishable from
the original.
To get started, navigate to the Pro Voice Clone tab in Gradium Studio and upload
your audio dataset. You will receive a notification once the upload is processed. After
training is complete, the voice will appear in your library and be ready for
Text-to-Speech (TTS) generation.
For an Instant Voice Clone, we get optimal results with only 10 seconds of data. For a
Pro Voice Clone, we require a minimum of 30 minutes of clean audio data. For optimal results
where the voice captures full emotional range and stability, we recommend
providing 2 hours of audio.