Gradium STT emitsDocumentation Index
Fetch the complete documentation index at: https://docs.gradium.ai/llms.txt
Use this file to discover all available pages before exploring further.
step messages every 80 ms. Each step contains
semantic VAD predictions: the probability that the speaker will be
inactive at several future horizons. Use those probabilities together
with delay_in_frames and flush to build turn-taking that feels
responsive without cutting people off.
Use these signals to decide when an agent should respond.
Basic Rule
Start with the longest horizon and a threshold around0.5:
| Product feel | Suggested tuning |
|---|---|
| Fast assistant | Shorter horizon or lower threshold. |
| Careful assistant | Longer horizon or higher threshold. |
| Noisy telephony | Require several consecutive high-confidence steps. |
| Dictation | Prefer explicit user controls or longer silence. |
Full Loop
Adaptive Delay
delay_in_frames controls how much context the STT model uses before
emitting text. Each frame is 80 ms. Larger values can improve text
quality but delay output. Smaller values are more reactive but may be
less stable. Supported values are 7, 8, 10, 12, 14, 16,
20, 24, 32, 36, and 48.
When the app decides a turn has ended, send_flush() asks the server
to process outstanding audio and then returns a matching flushed
message. Treat flushed as the point where the current turn is ready
for the next stage of your agent pipeline.
Related
Speech-to-Text WebSocket
STT message types, VAD details, and flushing.
Transcription Settings
Tune
language, temp, padding_bonus, and delay_in_frames.