Connect to this endpoint via WebSocket for real-time speech-to-text conversion with streaming audio input.
Connection URL:
For Europe
wss://eu.api.gradium.ai/api/speech/asr
For the USA
wss://us.api.gradium.ai/api/speech/asr
Authentication: Include your API key in the WebSocket connection header:
x-api-key: your_api_key| Direction | Message Type | Example |
|---|---|---|
| 🔵⬆️ Client→Server | Setup (first) | {"type": "setup", "model_name": "default", "input_format": "pcm"} |
| 🟢⬇️ Server→Client | Ready | {"type": "ready", "request_id": "uuid", "model_name": "default", "sample_rate": 24000} |
| 🔵⬆️ Client→Server | Audio | {"type": "audio", "audio": "base64..."} |
| 🟢⬇️ Server→Client | Text (result) | {"type": "text", "text": "Hello world", "start_s": 0.5} |
| 🟢⬇️ Server→Client | VAD (activity) | {"type": "step", "vad": [...], "step_idx": 5, "step_duration_s": 0.08} |
| 🟢⬇️ Server→Client | End Text | {"type": "end_text", "stop_s": 2.5} |
| 🔵⬆️ Client→Server | Flush | {"type": "flush", "flush_id": "..."} |
| 🟢⬇️ Server→Client | Flushed | {"type": "flushed", "flush_id": "..."} |
| 🔵⬆️ Client→Server | EndOfStream | {"type": "end_of_stream"} |
| 🟢⬇️ Server→Client | EndOfStream | {"type": "end_of_stream"} |
| 🔴⬇️ Server→Client | Error | {"type": "error", "message": "Error description", "code": 1008} |
Direction: Client → Server Format: JSON Object
{
"type": "setup",
"model_name": "default",
"input_format": "pcm"
}
Fields:
type (string, required): Must be “setup”model_name (string, required): The Speech-To-Text model to use (default: “default”)input_format (string, required): Audio format - “pcm”, “wav”, or “opus”Important: This must be the very first message sent after connection. The server will close the connection if any other message is sent first.
Direction: Server → Client Format: JSON Object
{
"type": "ready",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"model_name": "default",
"sample_rate": 24000,
"frame_size": 1920,
"delay_in_frames": 0,
"text_stream_names": []
}
Fields:
type (string): Will be “ready”request_id (string): Unique identifier for the sessionmodel_name (string): The Speech To Text model being usedsample_rate (integer): Expected sample rate in Hz (typically 24000)frame_size (int): Number of samples by which the model processes data (typically 1920 which is equivalent to 80ms at 24kHz)delay_in_frames (integer): Delay in audio frames for the modeltext_stream_names (array): List of text stream namesThis message is sent by the server after receiving the setup message, indicating that the connection is ready to receive audio.
Direction: Client → Server Format: JSON Object (with binary audio data)
{
"type": "audio",
"audio": "base64_encoded_audio_data..."
}
Fields:
type (string, required): Must be “audio”audio (string, required): Base64-encoded audio dataAudio Format Requirements (for PCM input):
When using "wav" input format, the audio must be a valid WAV file using
PCM data (so AudioFormat = 1 in the WAV header). Supported bits per sample
are 16, 24 and 32 bits.
When using "opus" input format, the audio must be some ogg wrapped opus data
stream.
Send audio messages to be transcribed. You can send multiple audio messages in sequence. The server will stream text and VAD responses as it processes the audio.
Direction: Server → Client Format: JSON Object
{
"type": "text",
"text": "Hello world",
"start_s": 0.5,
"stream_id": null
}
Fields:
type (string): Will be “text”text (string): The transcribed textstart_s (float): Start time of the transcription in secondsstream_id (integer or null): Stream identifier for tracking multiple concurrent streamsText messages contain the transcribed speech. Multiple text messages will be streamed as the audio is processed.
Direction: Server → Client Format: JSON Object
{
"type": "step",
"vad": [
{
"horizon_s": 0.5,
"inactivity_prob": 0.05
},
{
"horizon_s": 1.0,
"inactivity_prob": 0.08
},
{
"horizon_s": 2.0,
"inactivity_prob": 0.12
}
],
"step_idx": 5,
"step_duration_s": 0.08,
"total_duration_s": 0.4
}
Fields:
type (string): Will be “step”vad (array): List of VAD predictions with future horizons
horizon_s (float): Lookahead duration in secondsinactivity_prob (float): Probability that voice activity has ended by this horizon in seconds.step_idx (integer): The step index (increments every 80ms)step_duration_s (float): Duration of this step in seconds (typically 0.08)total_duration_s (float): Total duration of audio processed so farVAD Interpretation:
inactivity_prob value from the longest horizon to determine if the speaker has likely finishedinactivity_prob values indicate higher confidence that speaking has endedvad[2]["inactivity_prob"] (third prediction) as the turn-taking indicatorDirection: Server → Client Format: JSON Object
{
"type": "end_text",
"stop_s": 2.5,
"stream_id": null
}
Fields:
type (string): Will be “end_text”stop_s (float): Stop time of last text message in secondsstream_id (integer or null): Stream identifierSent when the previous text segment has a finished and its end timestamp is available.
Direction: Client → Server Format: JSON Object
{
"type": "flush",
"flush_id": "1"
}
Fields:
type (string, required): Must be “flush”flush_id (string, required): Unique identifier for the flush requestThis message can be sent by the client to request the server to flush any
buffered audio and return all outstanding text results immediately. The server
will respond with a flushed message containing the same flush_id once the
flush is complete.
Direction: Client → Server and Server → Client Format: JSON Object
{
"type": "end_of_stream"
}
This message is sent by the client when it has finished sending audio. The server will then process any remaining audio and send back all outstanding text results, VAD information, and then an end_of_stream message before closing the connection.
When errors occur, the server sends an error message as JSON before closing the connection:
Error Message Format:
{
"type": "error",
"message": "Error description explaining what went wrong",
"code": 1008
}
Common Error Codes:
1008: Policy Violation (e.g., invalid API key, missing setup message, invalid audio format)1011: Internal Server Error (unexpected server-side error)end_of_stream when done to properly close the sessionturn_ended = msg["vad"][2]["inactivity_prob"] > 0.5.delay_in_frames audio frames processed by the model. Instead of feeding silence from the speaker, the system can be made more reactive by flushing the remainder of the turn’s transcript. For that, you can feed in delay_in_frames chunks of silence (vectors of zeros). If those are fed in faster than realtime, the API also has a possibility to process them faster, allowing a considerably more reactive turn-around.Your Gradium API key
WebSocket connection established