Skip to main content
The Speech-to-Text model converts audio input into text transcriptions, supporting real-time streaming and a semantic VAD.

Basic Streaming Usage

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    # Audio generator that yields audio chunks
    async def audio_generator(audio_data, chunk_size=1920):
        for i in range(0, len(audio_data), chunk_size):
            yield audio_data[i : i + chunk_size]

    # Create STT stream
    stream = await client.stt_stream(
        {"model_name": "default", "input_format": "pcm"},
        audio_generator(audio_data),
    )

    # Process transcription results
    async for message in stream.iter_text():
        print(message)

if __name__ == "__main__":
    asyncio.run(main())

Setup Parameters

  • model_name: The STT model to use (default: "default")
  • input_format: Audio format of the input data (supported: "pcm", "wav", "opus")
When using "pcm" input format, the audio must adhere to the following specifications:
  • Sample Rate: 24000 Hz (24kHz)
  • Format: PCM (Pulse Code Modulation)
  • Bit Depth: 16-bit signed integer
  • Channels: Single channel (mono)
  • Chunk Size: Recommended 1920 samples per chunk (80ms at 24kHz)
When using "wav" input format, the audio must be a valid WAV file using PCM data (so AudioFormat = 1 in the WAV header). Supported bits per sample are 16, 24 and 32 bits. When using "opus" input format, the audio must be some ogg wrapped opus data stream.

Message Types

The STT stream returns different types of messages:
  • Text Messages (text): Contain transcription results together with timestamps.
  • VAD Messages (step): Provide Voice Activity Detection information to determine when the speaker has finished speaking.
  • Flushed Messages (flushed): Confirm that buffered audio has been processed (see Flushing below).
# Text messages containing transcription results
async for msg in stream._stream:
    if msg.get("type") == "text":
        print(f"Transcription: {msg}")

    # VAD (Voice Activity Detection) messages
    elif msg.get("type") == "step":
        vad_info = msg.get("vad", {})
        # Use msg["vad"][2]["inactivity_prob"] to detect turn completion
        # VAD steps occur every 80ms
        inactivity_probability = msg["vad"][2].get("inactivity_prob")
        print(f"Inactivity probability: {inactivity_probability}")

    # Flushed messages (response to send_flush)
    elif msg.get("type") == "flushed":
        print(f"Flush complete: {msg['flush_id']}")

Flushing

You can force the server to immediately process all buffered audio using send_flush(). This is useful when you need transcription results without waiting for the normal processing cadence, for example at a natural pause in conversation or before switching speakers. The server will process all outstanding audio, return any pending text results, and then respond with a flushed message containing the same flush_id you provided.
async with client.stt_realtime(model_name="default", input_format="pcm") as stt:
    # Send some audio
    for i in range(0, len(audio_chunk), 1920):
        await stt.send_audio(audio_chunk[i:i + 1920])

    # Force processing of all buffered audio
    await stt.send_flush(flush_id=1)

    # Continue receiving - text results arrive, then a flushed confirmation
    async for msg in stt:
        if msg["type"] == "text":
            print(f"Transcription: {msg['text']}")
        elif msg["type"] == "flushed":
            print(f"All audio up to flush {msg['flush_id']} has been processed")
            break

Advanced Options

Some models support advanced options that can be passed using the json_config parameter. In the Python api, this parameter is passed as a dictionary mapping string to values (either float or string). This parameter can be used to control:
  • Stability of the generated speech via the text temperature parameter.
  • Expected language via the language parameter.
  • Delay to generate the text in audio frames via the delay_in_frames parameter.
Temperature Control Sets the temperature used for text generation. The default value is 0 resulting in some greedy sampling. Higher values (up to 1) result in more diverse outputs, in particular these can be helpful if no text is recognized. Language Control Sets the expected language of the audio. This can help grounding the model to a specific language and improve transcription quality. If multiple languages are expected, this can be set to the main language. Delay Control Sets the delay in audio frames (80ms each) before text is generated. Higher delays allow the model to gather more context before generating text, which can improve quality at the cost of latency. The allowed values are 7, 8, 10, 12, 14, 16, 20, 24, 36, 48.