Turn-Taking with Semantic VAD

Gradium STT emits step messages every 80 ms. Each step contains semantic VAD predictions: the probability that the speaker will be inactive at several future horizons. Use those probabilities together with delay_in_frames and flush to build turn-taking that feels responsive without cutting people off. Use these signals to decide when an agent should respond.

Basic Rule

Start with the longest horizon and a threshold around 0.5:

def turn_has_probably_ended(msg):
    if msg["type"] != "step" or not msg["vad"]:
        return False
    horizon = msg["vad"][-1]
    return horizon["inactivity_prob"] > 0.5

This is a starting point, not a universal rule. Tune per application:

Product feel	Suggested tuning
Fast assistant	Shorter horizon or lower threshold.
Careful assistant	Longer horizon or higher threshold.
Noisy telephony	Require several consecutive high-confidence steps.
Dictation	Prefer explicit user controls or longer silence.

Full Loop

import asyncio
import gradium


async def transcribe_turns(audio_source):
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async with client.stt_realtime(
        model_name="default",
        input_format="pcm",
        json_config={"language": "en", "delay_in_frames": 16},
    ) as stt:
        transcript = []
        high_vad_steps = 0

        async def producer():
            async for chunk in audio_source:
                await stt.send_audio(chunk)
            await stt.send_eos()

        async def consumer():
            nonlocal high_vad_steps, transcript

            async for msg in stt:
                if msg["type"] == "text":
                    transcript.append(msg["text"])

                elif msg["type"] == "step":
                    inactivity = msg["vad"][-1]["inactivity_prob"]
                    high_vad_steps = high_vad_steps + 1 if inactivity > 0.5 else 0

                    if high_vad_steps >= 3 and transcript:
                        await stt.send_flush(flush_id=1)

                elif msg["type"] == "flushed":
                    text = " ".join(transcript).strip()
                    transcript = []
                    high_vad_steps = 0
                    await handle_user_turn(text)

                elif msg["type"] == "end_of_stream":
                    return

        await asyncio.gather(producer(), consumer())

Adaptive Delay

delay_in_frames controls how much context the STT model uses before emitting text. Each frame is 80 ms. Larger values can improve text quality but delay output. Smaller values are more reactive but may be less stable. Supported values are 7, 8, 10, 12, 14, 16, 20, 24, 32, 36, and 48. When the app decides a turn has ended, send_flush() asks the server to process outstanding audio and then returns a matching flushed message. Treat flushed as the point where the current turn is ready for the next stage of your agent pipeline.

Speech-to-Text WebSocket

STT message types, VAD details, and flushing.

Transcription Settings

Tune language, temp, padding_bonus, and delay_in_frames.

​Basic Rule

​Full Loop

​Adaptive Delay

​Related

Speech-to-Text WebSocket

Transcription Settings

Basic Rule

Full Loop

Adaptive Delay

Related