Skip to main content
STT models accept additional configuration via the json_config parameter. In the Python SDK, this is a dict mapping option name to value (float or string). When using the REST endpoint, pass it as a URL-encoded JSON string in the json_config query parameter. These options apply to both the WebSocket and REST transports. For TTS settings, see Voice Settings.

Quick reference

OptionTypeAllowed valuesEffect
tempfloat0.01.0Sampling temperature for text generation. 0.0 is greedy; higher values produce more diverse output and can help when no text is being recognised.
languagestring"en", "fr", "de", "es", "pt"Expected language of the audio. Grounds the model to a single language for better transcription quality. For mixed audio, set the main language. Any other value raises a server error.
padding_bonusfloat-4.04.0Biases the model toward emitting text sooner (negative) or later (positive). Out-of-range values raise a server error.
delay_in_framesint7, 8, 10, 12, 14, 16, 20, 24, 32, 36, 48Adaptive delay control. Each frame is 80 ms of context before text is emitted. Higher values improve quality at the cost of latency; lower values are more reactive. Other values raise a server error. The legacy alias delay_in_tokens is also accepted.
Default values may evolve with new model releases. Pin the options explicitly if you depend on a specific behaviour. The server validates known keys with the constraints above; unknown keys are silently ignored, so double-check spelling against the table.

Passing json_config

The same json_config payload is sent regardless of which SDK API you use; only the call shape differs:
config = {"language": "en", "temp": 0.3, "delay_in_frames": 16}

# Pull-based: setup is a dict containing json_config.
stream = await client.stt_stream(
    {"model_name": "default", "input_format": "pcm", "json_config": config},
    audio_generator(audio_data),
)

# Real-time: setup is keyword arguments; json_config stays nested.
async with client.stt_realtime(
    model_name="default",
    input_format="pcm",
    json_config=config,
) as stt:
    ...
When calling the REST endpoints directly, pass json_config as a URL-encoded JSON string in the query parameters, see the REST guide.

Semantic VAD and delay

delay_in_frames is most useful with the WebSocket STT stream. The server emits semantic VAD step messages every 80 ms; each step contains future horizons with inactivity_prob values. A common starting point for voice agents is to watch the longest horizon and flush when the inactivity probability stays above your threshold.
if msg["type"] == "step":
    turn_done = msg["vad"][-1]["inactivity_prob"] > 0.5
    if turn_done:
        await stt.send_flush(flush_id=1)
Use lower delay_in_frames values for fast back-and-forth assistants, and higher values for transcription quality when a little more latency is acceptable.