Transcription Settings

STT models accept additional configuration via the json_config parameter. In the Python SDK, this is a dict mapping option name to a JSON-serializable value. When using the REST endpoint, pass it as a URL-encoded JSON string in the json_config query parameter. These options apply to both the WebSocket and REST transports. For TTS settings, see Voice Settings.

Quick reference

Option	Type	Allowed values	Effect
`temp`	float	`0.0`-`1.5`	Sampling temperature for text generation. `0.0` is greedy; higher values produce more diverse output and can help when no text is being recognised.
`language`	string	`"en"`, `"fr"`, `"de"`, `"es"`, `"pt"`	Audio language for transcription models, or output language for translating models. See Languages.
`target_language`	string	`"en"`, `"fr"`, `"de"`, `"es"`, `"pt"`	Output language for translating models; ignored otherwise. See Languages.
`padding_bonus`	float	`-4.0`–`4.0`	Biases the model toward emitting text sooner (negative) or later (positive). Out-of-range values raise a server error.
`delay_in_frames`	int	`0`-`80`	Adaptive delay control. Each frame is 80 ms of context before text is emitted. Higher values improve quality at the cost of latency; lower values are more reactive. Values outside this range raise a server error. The legacy alias `delay_in_tokens` is also accepted.
`keywords`	object	`{ "words": string[], "boost": -6..6 }`	Bias transcription toward a custom dictionary of terms. `boost` is applied in log-space, so its effect is exponential; `3` is the recommended default. See Keyword boosting.

Default values may evolve with new model releases. Pin the options explicitly if you depend on a specific behaviour. The server validates known keys with the constraints above; unknown keys are silently ignored, so double-check spelling against the table.

Languages

For non-translating transcription models, language is the expected language of the audio. It grounds the model to a single language for better transcription quality. For audio that mixes languages, set the dominant one — or leave the option unset when no single language predominates. target_language has no effect on these models. For translating transcription models, language and target_language specify the language the transcription should be generated in, not the language of the audio. The two are interchangeable; set either one.

Keyword boosting

Adapt transcription to your own vocabulary: player names, brands, products, and channel names. Pass a keywords dictionary and the decoder gives those terms priority, so the vocabulary that matters to your product is transcribed exactly as you expect, in real time and without retraining.

config = {
    "language": "en",
    "keywords": {
        "words": ["Mbappé", "Haaland", "Vinícius", "Bellingham"],
        "boost": 3,
    },
}

words: the terms to boost. Matching is token-by-token, so keywords must not contain spaces. Split multi-word names, so "Ferran Torres" becomes "Ferran" and "Torres". Matching is also case and accent sensitive, so include the variants you expect ("Mbappé" and "Mbappe"). You can boost up to 500 keywords, and every variant of a word (lowercase, uppercase, accented, etc.) counts toward that limit.
boost: how strongly to bias toward the dictionary, from -6 to 6. The value is applied in log-probability space, so its effect is exponential: small values already shift decoding noticeably.

Choosing a boost. 3 is the recommended default and works for most cases; raise it to 4 for unusually rare or foreign vocabulary. Higher values give diminishing returns and can make the decoder loop, repeating a boosted word instead of following the audio, so avoid going much beyond 4 to 5. Negative values de-boost a term. Use them only to remove a specific word you never want in the output, not as a general accuracy control.

Boosting helps most for rare, domain-specific vocabulary and in low-latency (small delay_in_frames) streaming, where the model has less context to disambiguate names on its own. See the Keyword boosting recipe for an end-to-end example.

Passing `json_config`

The same json_config payload is sent regardless of which SDK API you use; only the call shape differs:

config = {"language": "en", "temp": 0.3, "delay_in_frames": 16}

# Pull-based: setup is a dict containing json_config.
stream = await client.stt_stream(
    {"model_name": "default", "input_format": "pcm", "json_config": config},
    audio_generator(audio_data),
)

# Real-time: setup is keyword arguments; json_config stays nested.
async with client.stt_realtime(
    model_name="default",
    input_format="pcm",
    json_config=config,
) as stt:
    ...

When calling the REST endpoints directly, pass json_config as a URL-encoded JSON string in the query parameters, see the REST guide.

Semantic VAD and delay

delay_in_frames is most useful with the WebSocket STT stream. The server emits semantic VAD step messages every 80 ms; each step contains future horizons with inactivity_prob values. A common starting point for voice agents is to watch the longest horizon and flush when the inactivity probability stays above your threshold.

if msg["type"] == "step":
    turn_done = msg["vad"][-1]["inactivity_prob"] > 0.5
    if turn_done:
        await stt.send_flush(flush_id=1)

Use lower delay_in_frames values for fast back-and-forth assistants, and higher values for transcription quality when a little more latency is acceptable. A value such as 16 is a balanced starting point; try 8 when responsiveness matters more than context.

Getting Started

Text-to-Speech

Speech-to-Text

Speech-to-Speech

Shared

Real-time Recipes

Migrations

Voices

Self-Hosted

Resources

Transcription Settings

Quick reference

Languages

Keyword boosting

Passing `json_config`

Semantic VAD and delay

​Quick reference

​Languages

​Keyword boosting

​Passing json_config

​Semantic VAD and delay

Quick reference

Languages

Keyword boosting

Passing `json_config`

Semantic VAD and delay