Speech-to-Text (WebSocket)

The Gradium Speech-to-Text model transcribes audio in real time over a WebSocket, with semantic Voice Activity Detection (VAD) and on-demand flush for low-latency turn-taking. This guide covers streaming via the Python SDK.

One-shot from a finished file?

For pre-recorded audio in hand, the REST POST endpoint is a single HTTP call, see the REST guide.

Quickstart

The shortest path: open an stt_realtime context manager, push audio chunks, and iterate to receive transcripts.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async with client.stt_realtime(
        model_name="default",
        input_format="pcm",
    ) as stt:
        # Producer: push 80 ms PCM frames as they're captured.
        async def producer():
            async for chunk in microphone_chunks():
                await stt.send_audio(chunk)
            await stt.send_eos()

        # Consumer: print transcripts as they arrive.
        async def consumer():
            async for msg in stt:
                if msg["type"] == "text":
                    print(msg["text"], end=" ", flush=True)
                elif msg["type"] == "end_of_stream":
                    return

        await asyncio.gather(producer(), consumer())

asyncio.run(main())

Expected audio: 24 kHz, 16-bit signed mono PCM in 1920-sample (80 ms) chunks. Other formats are supported, see Setup Parameters.

Choosing the SDK API

The Python SDK exposes two ways to stream STT over the WebSocket endpoint. Both talk to wss://api.gradium.ai/api/speech/asr: choose based on how your audio source is shaped.

Use case	Method
You can iterate the audio end-to-end (file, in-memory buffer, finite generator) and only need the transcript	`stt_stream`
You’re producing audio live (microphone, telephony, agent loop) and want to react to VAD or flush mid-stream	`stt_realtime`

stt_stream is a single async call that consumes your audio iterable and returns an STTStream whose iter_text() yields TextWithTimestamps segments. stt_realtime is an async context manager: you push chunks with send_audio() and iterate the same object to receive every message type, text, VAD step, end_text, flushed. Setup is a dict for stt_stream and keyword arguments for stt_realtime. The wire-level setup message they produce is the same.

Pull-based: `stt_stream`

Use stt_stream when the audio is already in hand, for example when transcribing a file or a complete in-memory buffer. The function consumes your audio iterable and yields finalised text segments.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async def audio_generator(audio_data, chunk_size=1920):
        for i in range(0, len(audio_data), chunk_size):
            yield audio_data[i : i + chunk_size]

    stream = await client.stt_stream(
        {"model_name": "default", "input_format": "pcm"},
        audio_generator(audio_data),
    )

    async for segment in stream.iter_text():
        print(f"[{segment.start_s:.2f}s] {segment.text}")

if __name__ == "__main__":
    asyncio.run(main())

iter_text() filters to text segments only and yields TextWithTimestamps objects (text, start_s, stop_s). To observe VAD or flushed messages, use stt_realtime below.

Real-time push: `stt_realtime`

Use stt_realtime when audio is being produced live (microphone, telephony, conversational agent) and you need to react to VAD or flush events as they arrive. It’s an async context manager, push audio with send_audio() and iterate the same object to receive every message type.

import asyncio
import gradium

async def main():
    client = gradium.client.GradiumClient(api_key="your-api-key")

    async with client.stt_realtime(
        model_name="default",
        input_format="pcm",
    ) as stt:
        async def producer():
            async for chunk in microphone_chunks():  # 80 ms PCM frames
                await stt.send_audio(chunk)
            await stt.send_eos()

        async def consumer():
            transcript = []
            async for msg in stt:
                if msg["type"] == "text":
                    transcript.append(msg["text"])
                elif msg["type"] == "step":
                    if msg["vad"][2]["inactivity_prob"] > 0.5:
                        print(f"[turn] {' '.join(transcript)}")
                        transcript = []
                elif msg["type"] == "end_of_stream":
                    return

        await asyncio.gather(producer(), consumer())

if __name__ == "__main__":
    asyncio.run(main())

Setup is passed as keyword arguments (model_name, input_format, json_config, …). See Setup Parameters below.

Setup Parameters

model_name: The STT model to use (default: "default").
input_format: Audio format of the input. Supported values:
- "pcm": raw PCM, 24 kHz, 16-bit signed little-endian, mono. Send 1920-sample (80 ms) chunks for best results.
- "pcm_8000", "pcm_16000", "pcm_22050", "pcm_24000", "pcm_44100", "pcm_48000": raw PCM at the indicated sample rate (16-bit signed little-endian, mono).
- "wav": a valid WAV file with PCM data (AudioFormat = 1 in the WAV header). 16-, 24-, and 32-bit samples are supported.
- "opus": Ogg-wrapped Opus stream.
- "ulaw_8000" (alias "mulaw_8000"): mu-law encoded PCM at 8 kHz.
- "alaw_8000": A-law encoded PCM at 8 kHz.

Unrecognised values raise a server error.

Message Types

When iterating over an stt_realtime session you receive every message the server sends. The relevant types for transcription:

text: a transcribed text segment with a start_s timestamp.
end_text: finalises the previous text segment with a stop_s timestamp. See Timestamps.
step: Voice Activity Detection information emitted every 80 ms. See Voice Activity Detection (VAD).
flushed: confirmation that all audio submitted before a send_flush() has been processed. See Flushing.
end_of_stream: terminal message after send_eos().

For the wire-level schema (full field types and on-the-wire JSON), see the STT WebSocket reference.

async with client.stt_realtime(
    model_name="default",
    input_format="pcm",
) as stt:
    # ... send audio in a producer task ...

    async for msg in stt:
        if msg["type"] == "text":
            print(f"Transcription: {msg['text']}")

        elif msg["type"] == "step":
            # VAD steps arrive every 80 ms.
            inactivity_prob = msg["vad"][2]["inactivity_prob"]
            print(f"Inactivity probability: {inactivity_prob}")

        elif msg["type"] == "flushed":
            print(f"Flush complete: {msg['flush_id']}")

stt_stream exposes only finalised text segments via iter_text(). To observe step, end_text, or flushed messages, use stt_realtime.

Timestamps

Each text message carries a start_s field, the time in seconds within the input audio at which the transcribed segment begins. When the segment is finalised, the server emits an end_text message with the corresponding stop_s:

{"type": "text",     "text": "Hello world", "start_s": 0.5, "stream_id": 0}
{"type": "end_text", "stop_s": 2.5, "stream_id": 0}

Pair text and end_text by stream_id to recover (text, start_s, stop_s) triples per segment. Timestamps are emitted at segment granularity. For finer-grained word or character offsets, derive them from the segment text.

Voice Activity Detection (VAD)

Gradium’s STT WebSocket emits a step message every 80 ms containing semantic VAD predictions across multiple horizons: for each horizon (in seconds) the model reports the probability that the speaker will be silent by that point in the future. This is more informative than a simple voiced/unvoiced flag and is designed for turn-taking in conversational systems.

{
  "type": "step",
  "vad": [
    {"horizon_s": 0.5, "inactivity_prob": 0.05},
    {"horizon_s": 1.0, "inactivity_prob": 0.08},
    {"horizon_s": 2.0, "inactivity_prob": 0.62}
  ],
  "step_idx": 5,
  "step_duration_s": 0.08,
  "total_duration_s": 0.4
}

Detecting end of turn

The recommended starting point is to look at the 2 s horizon and trigger when the inactivity probability crosses 0.5:

turn_ended = msg["vad"][2]["inactivity_prob"] > 0.5

Tune the horizon and threshold per application: shorter horizons / higher thresholds make the system more reactive but can cut speakers off; longer horizons / lower thresholds are more conservative.

Adaptive Delay

Gradium STT is intentionally not a fixed-latency black box. The delay_in_frames option controls how much audio context the model uses before emitting text. Each frame is 80 ms, and the server returns the active value in the ready.delay_in_frames field. Higher values give the model more context and can improve transcription stability. Lower values reduce latency. Supported values are 7, 8, 10, 12, 14, 16, 20, 24, 32, 36, and 48. At a turn boundary, pair semantic VAD with send_flush(): when your app decides the speaker is done, flush asks Gradium to process audio already submitted and then returns flushed with the matching flush_id. Internally, the worker tracks the adjusted delay after flush events so segment timestamps remain aligned with the audio that was processed. VAD step messages are emitted on the WebSocket transport.

Flushing

You can force the server to immediately process all buffered audio using send_flush(). This is useful when you need transcription results without waiting for the normal processing cadence, for example at a natural pause in conversation or before switching speakers. The server will process all outstanding audio, return any pending text results, and then respond with a flushed message containing the same flush_id you provided.

async with client.stt_realtime(model_name="default", input_format="pcm") as stt:
    # Send some audio
    for i in range(0, len(audio_chunk), 1920):
        await stt.send_audio(audio_chunk[i:i + 1920])

    # Force processing of all buffered audio
    await stt.send_flush(flush_id=1)

    # Continue receiving - text results arrive, then a flushed confirmation
    async for msg in stt:
        if msg["type"] == "text":
            print(f"Transcription: {msg['text']}")
        elif msg["type"] == "flushed":
            print(f"All audio up to flush {msg['flush_id']} has been processed")
            break

Tunable options

Models accept additional configuration via json_config, including language, temp, padding_bonus, and delay_in_frames. Use delay_in_frames as the adaptive delay control for your latency versus quality tradeoff. See Transcription Settings for the full table and how to pass json_config across the SDK and REST.

Direct WebSocket

If you don’t want the gradium SDK, for example you’re calling from a non-Python runtime, or you want full control over the wire, you can talk to the WebSocket protocol directly. The message shapes are identical to what the SDK sends. For interactive poking from a terminal, wscat is the quickest way to confirm reachability and step through messages by hand:

wscat -c "wss://api.gradium.ai/api/speech/asr" \
  -H "x-api-key: your_api_key"
# After connection, type:
# {"type":"setup","model_name":"default","input_format":"pcm"}

For a client without the SDK, use any WebSocket library (here Python’s websockets):

import asyncio
import base64
import json

import websockets

CHUNK_BYTES = 1920 * 2  # 80 ms at 24 kHz, 16-bit mono.


async def transcribe(api_key: str, pcm_audio: bytes):
    setup = {
        "type": "setup",
        "model_name": "default",
        "input_format": "pcm",
    }

    async with websockets.connect(
        "wss://api.gradium.ai/api/speech/asr",
        additional_headers={"x-api-key": api_key},
    ) as ws:
        await ws.send(json.dumps(setup))
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"

        async def producer():
            for off in range(0, len(pcm_audio), CHUNK_BYTES):
                chunk = pcm_audio[off : off + CHUNK_BYTES]
                await ws.send(json.dumps({
                    "type": "audio",
                    "audio": base64.b64encode(chunk).decode(),
                }))
            await ws.send(json.dumps({"type": "end_of_stream"}))

        async def consumer():
            while True:
                msg = json.loads(await ws.recv())
                if msg["type"] == "text":
                    print(msg["text"])
                elif msg["type"] == "end_of_stream":
                    return
                elif msg["type"] == "error":
                    raise RuntimeError(msg["message"])

        await asyncio.gather(producer(), consumer())

asyncio.run(transcribe("your_api_key", open("input.pcm", "rb").read()))

For the full message schema (every field, every error code), see the STT WebSocket reference.

Next steps

STT WebSocket reference

Complete wire-level schema: every message type, every field, every error code.

One-shot REST

Transcribe a complete audio file with a single HTTP request.

Transcription settings

language, temp, padding_bonus, delay_in_frames, and how to pass json_config.

Errors

Error contracts across REST, WebSocket, and streamed responses.

One-shot from a finished file?

​Quickstart

​Choosing the SDK API

​Pull-based: stt_stream

​Real-time push: stt_realtime

​Setup Parameters

​Message Types

​Timestamps

​Voice Activity Detection (VAD)

​Detecting end of turn

​Adaptive Delay

​Flushing

​Tunable options

​Direct WebSocket

​Next steps