Skip to main content
Your AI assistant can operate in three distinct modes. Each mode determines how the caller’s speech is understood and how the assistant’s reply is generated. Choosing the right mode affects response latency, naturalness, and voice options.

Pipeline

Label in UIPipeline
How it worksSpeech-to-Text → LLM → Text-to-Speech
Latency~800–1500 ms (varies by language and model)
Best forComplex reasoning, dynamic prompts, multi-sentence replies
Pipeline mode transcribes the caller’s words into text, runs that text through the language model, then converts the response back to audio. It offers maximum flexibility:
  • Supports all voices in the library, including custom-cloned voices.
  • Handles long-form answers or paragraph-style responses well.
  • Allows the LLM to inject variables and reference earlier conversation context.
Choose Pipeline when:
  1. You need rich, multi-sentence answers (for example, support queries or detailed explanations).
  2. The assistant must reason over structured data or complex prompts.
  3. You need full control over the spoken voice, including cloned or brand voices.

Speech-to-speech (multimodal)

Label in UISpeech-to-speech
How it worksDirect speech-to-speech generation — no intermediate text
Latency~300–600 ms (ultra low)
Best forNatural back-and-forth, short and reactive replies
Speech-to-speech skips separate transcription and TTS. A multimodal model listens and speaks directly, producing more conversational flow:
  • Fast turn-taking — callers experience near-instant responses.
  • Generates more expressive prosody natively (intonation, fillers).
  • Currently supports a limited voice set, though more are added regularly.
Choose speech-to-speech when:
  1. The conversation needs to feel snappy (sales calls, booking confirmations).
  2. Your replies are generally short sentences or quick acknowledgements.
  3. You’re comfortable with the system-provided voice options.
Speech-to-speech is evolving rapidly. If you need a custom cloned voice with low latency, try Dualplex instead.

Dualplex (Beta)

Label in UIDualplex
How it worksMultimodal STT + LLM (speech-to-speech) with ElevenLabs TTS output
LatencyLow (varies by voice and model)
Best forFast, natural replies with premium or cloned voices
Dualplex blends the responsiveness of speech-to-speech with premium ElevenLabs voices and voice cloning. The assistant uses the multimodal model to understand the caller and plan the reply, then renders the final speech through ElevenLabs for consistent, high-fidelity output.
  • Near-instant turn-taking, similar to speech-to-speech.
  • Access to the full ElevenLabs voice library, including custom-cloned voices.
  • Great for short to medium replies with expressive delivery.
  • Recommended default for most use cases today.
Choose Dualplex when:
  1. You want fast back-and-forth but need a branded or cloned voice.
  2. You want expressive delivery without giving up precise voice choice.
  3. You’re comfortable using a Beta feature.

Switching modes

Select the mode for each assistant in Assistant → Settings → Voice Engine. Test all three modes on real calls to find the best balance of speed and quality for your use case.
Record two calls — one in each mode — and compare the caller’s perceived latency and engagement level to decide which fits your flow.

Mode comparison

FeaturePipelineSpeech-to-speechDualplex
Latency~800–1500 ms~300–600 msLow
Custom/cloned voicesYesNoYes (ElevenLabs)
Long-form answersExcellentLimitedGood
Turn-taking speedSlowerFastestFast
StatusStableStableBeta