Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gradium.ai/llms.txt

Use this file to discover all available pages before exploring further.

Telephony providers commonly send 8 kHz mono audio encoded as mu-law or A-law. Gradium supports those formats directly for STT and can produce telephony-friendly TTS output.
Use caseGradium formatNotes
Incoming PSTN audio to STTulaw_8000 or alaw_8000Use the encoding your telephony provider sends.
Incoming linear PCM to STTpcm_8000 or pcm_1600016-bit signed little-endian mono PCM.
Outgoing audio to PSTNulaw_8000 or alaw_8000Avoid resampling in your telephony bridge when possible.
Higher quality app audiopcm_24000, pcm_48000, wav, or opusUse outside PSTN constraints.

STT WebSocket Setup

{
  "type": "setup",
  "model_name": "default",
  "input_format": "ulaw_8000",
  "json_config": {"language": "en"}
}
Then send base64-encoded audio payloads:
{"type": "audio", "audio": "base64_encoded_ulaw_chunk"}

TTS WebSocket Setup

{
  "type": "setup",
  "voice_id": "YTpq7expH9539ERJ",
  "output_format": "ulaw_8000"
}
The audio messages contain base64-encoded mu-law chunks that can be forwarded to a telephony media stream.

Chunk Size Guidance

Many telephony media streams send 20 ms frames. Gradium accepts small chunks, but batching to roughly 80 ms can reduce message overhead while keeping latency low:
Format20 ms payload80 ms payload
ulaw_8000 / alaw_8000160 bytes640 bytes
pcm_8000320 bytes1280 bytes
pcm_16000640 bytes2560 bytes

Bridge Checklist

  • Preserve the provider’s stream/session ID in your own logs.
  • Use one Gradium STT session per caller audio stream.
  • Set input_format to the actual bytes you forward.
  • For TTS, request an output format your provider can play directly.
  • Use STT step messages or your provider’s VAD to decide when to send the user’s turn to an agent.
  • Close the WebSocket and start a new session if the call leg changes format.

Speech-to-Text WebSocket

Real-time transcription with VAD and flush.

Text-to-Speech WebSocket

Stream generated audio back to your telephony bridge.