Kataleptic
§ Docs Updated 2026-06-10

Realtime voice

Speech-to-speech over WebSocket, wire-compatible with the OpenAI Realtime API. Point an unmodified OpenAI realtime client at api.kataleptic.com and it works — GA dialect by default, beta dialect auto-detected. Three tiers behind one endpoint, selected by the model id.

Endpointwss://api.kataleptic.com/v1/realtime?model=<id>
AuthBearer dg_... · ?token= · subprotocol
ProtocolOpenAI Realtime (GA; beta auto-detected)
AudioPCM16 @ 16/24 kHz · G.711 on Azure tiers

Quickstart

Open a WebSocket, configure the session, send a bare response.create to make the agent speak first, then stream microphone audio in and play audio deltas out. That is the whole loop.

// Browser / Cloudflare Workers — auth via subprotocol
const ws = new WebSocket(
  "wss://api.kataleptic.com/v1/realtime?model=kataleptic-realtime",
  ["realtime", "openai-insecure-api-key." + KATALEPTIC_API_KEY],
);

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are the booking agent for a small hotel. " +
                    "Open by greeting the caller and asking how you can help.",
      turn_detection: { type: "server_vad", silence_duration_ms: 400 },
    },
  }));
  // Bare response.create → the agent speaks its opening line.
  ws.send(JSON.stringify({ type: "response.create" }));
};

ws.onmessage = (e) => {
  const ev = JSON.parse(e.data);
  if (ev.type === "response.output_audio.delta") {
    playPcm16(atob(ev.delta));            // PCM16 mono @ 24 kHz
  } else if (ev.type === "response.output_audio_transcript.done") {
    console.log("agent said:", ev.transcript);
  }
};

// Stream microphone audio as base64-encoded PCM16:
function sendChunk(base64Pcm16) {
  ws.send(JSON.stringify({ type: "input_audio_buffer.append", audio: base64Pcm16 }));
}

Already on the OpenAI SDK? Unmodified OpenAI realtime clients work as-is — change the host to api.kataleptic.com and keep your code. We speak the GA dialect by default and switch to the beta dialect automatically when your client sends the OpenAI-Beta: realtime=v1 header or the openai-beta.realtime-v1 subprotocol.

Authentication

Three ways to present your dg_… key, in order of preference:

  • HeaderAuthorization: Bearer dg_…. Use this from servers.
  • Query parameter?token=dg_… appended to the WebSocket URL, for clients that cannot set headers.
  • Subprotocolopenai-insecure-api-key.dg_… in the WebSocket subprotocol list, the same convention OpenAI uses for browser and Workers clients. As the name says: only use this with short-lived keys you are comfortable exposing to the client.

The three tiers

One endpoint, three engines. The model id in ?model= selects the engine; everything else about the protocol stays the same.

Model idEngineFirst audioResidencyTranscriptsTypical price
kataleptic-realtime Cascade: Whisper STT → chat model → Piper TTS ~250 ms EU, our own fleet Exact ≈$0.0133/min
kataleptic-realtime-hd Azure Voice Live (Sweden Central) ~1.2 s EU Exact ≈$0.03/min
gpt-realtime-2 Native speech-to-speech ~1.0 s Global routing — not EU-pinned Model approximation ≈$0.07/min

kataleptic-realtime — the default

A fully EU-resident cascade on our own fleet: streaming Whisper speech-to-text, a catalogue chat model in the middle, Piper text-to-speech on the way out. The brain is swappable per session — pass any catalogue chat model id in ?model= (default mistral-nemo-12b) and the cascade uses it. Ten languages are auto-detected per utterance — EN, DE, FR, ES, NL, SV, DA, IT, FI, RU — and the TTS voice follows the detected language. Server-side VAD with barge-in; speech recognition is noise-gated (Silero VAD plus no-speech and language-probability thresholds), so breathing and background noise do not become turns.

kataleptic-realtime-hd — premium voices

The same WebSocket, served by Azure Voice Live in Sweden Central: 600+ HD neural voices, deep noise suppression, echo cancellation, and semantic turn detection (the model judges whether the caller is done, not just the silence timer). EU-resident processing, exact transcripts.

gpt-realtime-2 — native speech-to-speech

No cascade — one model hears audio and speaks audio. Best prosody and expressiveness of the three; it responds to tone, hesitation, and emphasis, not just words. The trade-offs are real and listed in caveats: inference is globally routed (not EU-pinned) and transcripts are the model's own approximation of what was said.

session.update reference

Send session.update as your first message to configure the conversation. The supported subset:

FieldTypeWhat it does
instructionsstringThe system prompt. Persona, opening line, guardrails.
voicestringVoice selection. On the default tier the voice follows the detected language; on the Azure tiers pick from their voice catalogues.
turn_detection.type"server_vad"Server-side voice activity detection. The server decides when the caller's turn ends.
turn_detection.thresholdnumberVAD sensitivity. Higher = needs louder/clearer speech to open a turn.
turn_detection.prefix_padding_msnumberAudio retained from before speech onset, so first syllables are not clipped.
turn_detection.silence_duration_msnumberTrailing silence that ends the turn. Lower = snappier, more interruptions.
turn_detection.create_responsebooleanAuto-respond when a turn ends. Set false to drive responses yourself with response.create.
turn_detection.interrupt_responsebooleanBarge-in: caller speech cancels the agent's in-flight reply.

Protocol subset

Client events we accept, on every tier:

  • session.update — configure instructions, voice, turn detection (see above).
  • input_audio_buffer.append / .commit / .clear — stream caller audio; commit manually if you run your own VAD.
  • conversation.item.create / .delete / .truncate — edit conversation history, including previous_item_id placement and root insertion.
  • response.create / response.cancel — request or cancel an agent reply.

Audio is PCM16 at 16 or 24 kHz in both directions on all tiers. The two Azure tiers additionally accept G.711 for telephony — see below.

Transcripts & call logging

Both directions of the conversation arrive as text events, which is all you need to build a call log:

  • Caller sideconversation.item.input_audio_transcription.completed fires once per caller utterance with the final transcript.
  • Agent sideresponse.output_audio_transcript.delta streams the agent's words as it speaks; response.output_audio_transcript.done carries the full utterance.

gpt-realtime-2 only: caller transcripts are on by default (we enable input_audio_transcription with whisper-1 for you; override or disable it in session.update). The agent-side transcript is the model's approximation of its own speech, not an exact STT transcript; if your call logs have compliance weight, use the standard or HD tier. On the standard tier, transcription events also carry language and language_probability fields.

Greeting pattern

Phone agents should speak first. Put the opening line in instructions, then send a bare response.create — no conversation items needed:

{ "type": "session.update",
  "session": { "instructions": "Greet the caller: 'Grüß Gott, Hotel Sacher reception.' Then assist." } }

{ "type": "response.create" }

The agent speaks the greeting per its instructions, and the normal turn-taking loop begins from there.

Telephony / G.711

SIP trunks and most PSTN gateways hand you G.711. On kataleptic-realtime-hd and gpt-realtime-2 you can pass it straight through without transcoding:

  • Beta-dialect flat fields: "input_audio_format": "g711_ulaw" (or "g711_alaw"), same for output.
  • GA-dialect format objects: {"type": "audio/pcmu"} / {"type": "audio/pcma"}.

The default kataleptic-realtime tier is PCM16-only — transcode at your media gateway if you bridge it to a trunk.

Caveats per tier

kataleptic-realtime

  • Cascade voices are functional, not studio-grade — if voice quality is the product, use HD.
  • One voice per language; the voice field has limited effect because the voice follows the detected language.

kataleptic-realtime-hd

  • First audio ~1.2 s — noticeably slower to open than the default tier's ~250 ms.

gpt-realtime-2

  • Not EU-pinned — inference uses global routing. Do not put it behind a residency commitment.
  • Agent transcripts are model approximations, not exact STT output (caller transcripts use whisper-1, on by default).

Function calling

All three tiers support OpenAI Realtime function calling. Define tools in session.update (flat realtime shape: {"type": "function", "name", "description", "parameters"}); when the model decides to call one you receive response.function_call_arguments.delta events, a final response.function_call_arguments.done with the JSON arguments, and a function_call item in response.done. Send the result back as a conversation.item.create with {"type": "function_call_output", "call_id", "output"} followed by response.create.

On the standard tier the cascade brain executes the tool call; small models occasionally write a call as prose instead of invoking it — the server strips text that exactly matches a defined tool-call pattern from the spoken audio, so the agent never says "end_call()" aloud.

Session limits & lifecycle

  • Max session duration: 60 minutes. One minute before the cutoff the server emits a vendor-extension event {"type": "session.expiring", "reason": "max_session_duration", "expires_in_seconds": …} so bridges can reconnect gracefully. Clients that ignore unknown events lose nothing.
  • Idle timeout: 5 minutes without any WebSocket message (continuous audio streaming counts as activity).
  • Server deploys can terminate live sessions; production bridges should reconnect on unexpected close and re-send session.update.

Voice catalog per tier

  • kataleptic-realtime — voice follows the detected caller language automatically across all ten languages. Before the first caller utterance, the initial voice seeds from input_audio_transcription.language when set, or from the language of your instructions — so instruction-driven greetings come out in the right voice with zero configuration. The language field is a seed and STT-accuracy hint, not a cage: once real speech arrives, per-utterance detection overrides it — in the reported language field, the voice, and what the model is told — even when it contradicts the seed. To pin a voice explicitly, pass a Piper id as voice: en_US-lessac-medium, de_DE-thorsten-medium, fr_FR-siwis-medium, es_ES-sharvard-medium, nl_NL-mls-medium, sv_SE-nst-medium, da_DK-talesyntese-medium, it_IT-paola-medium, fi_FI-harri-medium, ru_RU-irina-medium. OpenAI voice names are accepted and ignored in favor of language-matching.
  • kataleptic-realtime-hd — any Azure neural voice name passes through (e.g. de-DE-SeraphinaMultilingualNeural, 600+ voices); OpenAI voice names (alloy, marin, …) map to Azure multilingual voices; Piper ids map to the closest Azure voice. Default is a multilingual voice, so language-follow works with no configuration.
  • gpt-realtime-2 — OpenAI voices only (marin, cedar, alloy, …); non-OpenAI names coerce to marin. Voices are natively multilingual.

The full machine-readable catalog (including the live per-language Piper map) is served at GET /v1/realtime/voices — no auth required. One namespace, three tiers: OpenAI names work everywhere; engine-native names (Piper ids, Azure voice names) work on their own tier and degrade gracefully elsewhere. Unknown transcription.model values return an error event; supported values are whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe (→ turbo) and whisper-large-v3 (full model, ~+110 ms, better on noisy audio).

Choosing the cascade brain

  • mistral-nemo-12b (default) — fastest replies (~0.3 s first audio), solid small-talk and form-filling; weaker at multi-step reasoning (dates, arithmetic) and occasionally imperfect language adherence on long prompts.
  • llama-3.3-70b — strong reasoning and reliable multilingual replies at ~1–1.4 s first audio. Recommended for production receptionists that must reason about schedules.
  • gpt-5.4-mini / other catalogue models — pick any chat model via ?model=; latency is dominated by that model's time-to-first-token.

Pricing & billing

  • kataleptic-realtime — $0.0033/min audio in + $0.01/min audio out, plus the chat model's tokens at its catalogue rate. ≈$0.0133/min all-in with the default brain.
  • kataleptic-realtime-hd — billed per token at Azure Voice Live rates with a service margin; ≈$0.03/min typical.
  • gpt-realtime-2 — billed per text + audio token; ≈$0.07/min typical.

Usage shows up on your key under the model ids kataleptic-realtime, kataleptic-realtime-hd and gpt-realtime-2 — same GET /v1/auth/key surface as everything else.

Voice minutes are cheap to try: the $5 of free signup credit buys roughly six hours of conversation on the default tier. Get a key and say hello to it.