Basic Streaming Usage
Setup Parameters
model_name: The STT model to use (default:"default")input_format: Audio format of the input data (supported:"pcm","wav","opus")
"pcm" input format, the audio must adhere to the following
specifications:
- Sample Rate: 24000 Hz (24kHz)
- Format: PCM (Pulse Code Modulation)
- Bit Depth: 16-bit signed integer
- Channels: Single channel (mono)
- Chunk Size: Recommended 1920 samples per chunk (80ms at 24kHz)
"wav" input format, the audio must be a valid WAV file using
PCM data (so AudioFormat = 1 in the WAV header). Supported bits per sample
are 16, 24 and 32 bits.
When using "opus" input format, the audio must be some ogg wrapped opus data
stream.
Message Types
The STT stream returns different types of messages:- Text Messages (
text): Contain transcription results together with timestamps. - VAD Messages (
step): Provide Voice Activity Detection information to determine when the speaker has finished speaking. - Flushed Messages (
flushed): Confirm that buffered audio has been processed (see Flushing below).
Flushing
You can force the server to immediately process all buffered audio usingsend_flush(). This is useful when you need transcription results without waiting for the normal processing cadence, for example at a natural pause in conversation or before switching speakers.
The server will process all outstanding audio, return any pending text results, and then respond with a flushed message containing the same flush_id you provided.
Advanced Options
Some models support advanced options that can be passed using thejson_config
parameter. In the Python api, this parameter is passed as a dictionary mapping
string to values (either float or string).
This parameter can be used to control:
- Stability of the generated speech via the
texttemperature parameter. - Expected language via the
languageparameter. - Delay to generate the text in audio frames via the
delay_in_framesparameter.
7, 8, 10, 12, 14, 16, 20, 24, 36, 48.