Your AI assistant can operate in three distinct modes. Each mode determines how the caller’s speech is understood and how the assistant’s reply is generated. Choosing the right mode affects response latency, naturalness, and voice options.
Pipeline
| |
|---|
| Label in UI | Pipeline |
| How it works | Speech-to-Text → LLM → Text-to-Speech |
| Latency | ~800–1500 ms (varies by language and model) |
| Best for | Complex reasoning, dynamic prompts, multi-sentence replies |
Pipeline mode transcribes the caller’s words into text, runs that text through the language model, then converts the response back to audio. It offers maximum flexibility:
- Supports all voices in the library, including custom-cloned voices.
- Handles long-form answers or paragraph-style responses well.
- Allows the LLM to inject variables and reference earlier conversation context.
Choose Pipeline when:
- You need rich, multi-sentence answers (for example, support queries or detailed explanations).
- The assistant must reason over structured data or complex prompts.
- You need full control over the spoken voice, including cloned or brand voices.
Speech-to-speech (multimodal)
| |
|---|
| Label in UI | Speech-to-speech |
| How it works | Direct speech-to-speech generation — no intermediate text |
| Latency | ~300–600 ms (ultra low) |
| Best for | Natural back-and-forth, short and reactive replies |
Speech-to-speech skips separate transcription and TTS. A multimodal model listens and speaks directly, producing more conversational flow:
- Fast turn-taking — callers experience near-instant responses.
- Generates more expressive prosody natively (intonation, fillers).
- Currently supports a limited voice set, though more are added regularly.
Choose speech-to-speech when:
- The conversation needs to feel snappy (sales calls, booking confirmations).
- Your replies are generally short sentences or quick acknowledgements.
- You’re comfortable with the system-provided voice options.
Speech-to-speech is evolving rapidly. If you need a custom cloned voice with low latency, try Dualplex instead.
Dualplex (Beta)
| |
|---|
| Label in UI | Dualplex |
| How it works | Multimodal STT + LLM (speech-to-speech) with ElevenLabs TTS output |
| Latency | Low (varies by voice and model) |
| Best for | Fast, natural replies with premium or cloned voices |
Dualplex blends the responsiveness of speech-to-speech with premium ElevenLabs voices and voice cloning. The assistant uses the multimodal model to understand the caller and plan the reply, then renders the final speech through ElevenLabs for consistent, high-fidelity output.
- Near-instant turn-taking, similar to speech-to-speech.
- Access to the full ElevenLabs voice library, including custom-cloned voices.
- Great for short to medium replies with expressive delivery.
- Recommended default for most use cases today.
Choose Dualplex when:
- You want fast back-and-forth but need a branded or cloned voice.
- You want expressive delivery without giving up precise voice choice.
- You’re comfortable using a Beta feature.
Switching modes
Select the mode for each assistant in Assistant → Settings → Voice Engine. Test all three modes on real calls to find the best balance of speed and quality for your use case.
Record two calls — one in each mode — and compare the caller’s perceived latency and engagement level to decide which fits your flow.
Mode comparison
| Feature | Pipeline | Speech-to-speech | Dualplex |
|---|
| Latency | ~800–1500 ms | ~300–600 ms | Low |
| Custom/cloned voices | Yes | No | Yes (ElevenLabs) |
| Long-form answers | Excellent | Limited | Good |
| Turn-taking speed | Slower | Fastest | Fast |
| Status | Stable | Stable | Beta |