Assistant modes

Your AI assistant can operate in three distinct modes. Each mode determines how the caller’s speech is understood and how the assistant’s reply is generated. Choosing the right mode affects response latency, naturalness, and voice options.

Pipeline


Label in UI	`Pipeline`
How it works	Speech-to-Text → LLM → Text-to-Speech
Latency	~800–1500 ms (varies by language and model)
Best for	Complex reasoning, dynamic prompts, multi-sentence replies

Pipeline mode transcribes the caller’s words into text, runs that text through the language model, then converts the response back to audio. It offers maximum flexibility:

Supports all voices in the library, including custom-cloned voices.
Handles long-form answers or paragraph-style responses well.
Allows the LLM to inject variables and reference earlier conversation context.

Choose Pipeline when:

You need rich, multi-sentence answers (for example, support queries or detailed explanations).
The assistant must reason over structured data or complex prompts.
You need full control over the spoken voice, including cloned or brand voices.

Speech-to-speech (multimodal)


Label in UI	`Speech-to-speech`
How it works	Direct speech-to-speech generation — no intermediate text
Latency	~300–600 ms (ultra low)
Best for	Natural back-and-forth, short and reactive replies

Speech-to-speech skips separate transcription and TTS. A multimodal model listens and speaks directly, producing more conversational flow:

Fast turn-taking — callers experience near-instant responses.
Generates more expressive prosody natively (intonation, fillers).
Currently supports a limited voice set, though more are added regularly.

Choose speech-to-speech when:

The conversation needs to feel snappy (sales calls, booking confirmations).
Your replies are generally short sentences or quick acknowledgements.
You’re comfortable with the system-provided voice options.

Speech-to-speech is evolving rapidly. If you need a custom cloned voice with low latency, try Dualplex instead.

Dualplex (Beta)


Label in UI	`Dualplex`
How it works	Multimodal STT + LLM (speech-to-speech) with ElevenLabs TTS output
Latency	Low (varies by voice and model)
Best for	Fast, natural replies with premium or cloned voices

Dualplex blends the responsiveness of speech-to-speech with premium ElevenLabs voices and voice cloning. The assistant uses the multimodal model to understand the caller and plan the reply, then renders the final speech through ElevenLabs for consistent, high-fidelity output.

Near-instant turn-taking, similar to speech-to-speech.
Access to the full ElevenLabs voice library, including custom-cloned voices.
Great for short to medium replies with expressive delivery.
Recommended default for most use cases today.

Choose Dualplex when:

You want fast back-and-forth but need a branded or cloned voice.
You want expressive delivery without giving up precise voice choice.
You’re comfortable using a Beta feature.

Switching modes

Select the mode for each assistant in Assistant → Settings → Voice Engine. Test all three modes on real calls to find the best balance of speed and quality for your use case.

Record two calls — one in each mode — and compare the caller’s perceived latency and engagement level to decide which fits your flow.

Mode comparison

Feature	Pipeline	Speech-to-speech	Dualplex
Latency	~800–1500 ms	~300–600 ms	Low
Custom/cloned voices	Yes	No	Yes (ElevenLabs)
Long-form answers	Excellent	Limited	Good
Turn-taking speed	Slower	Fastest	Fast
Status	Stable	Stable	Beta

Get Started

AI Assistants

Calls & Campaigns

Channels

Leads & Data

Integrations

Support

Pipeline

Speech-to-speech (multimodal)

Dualplex (Beta)

Switching modes

Mode comparison

Get Started

AI Assistants

Calls & Campaigns

Channels

Leads & Data

Integrations

Support

Documentation Index

​Pipeline

​Speech-to-speech (multimodal)

​Dualplex (Beta)

​Switching modes

​Mode comparison

Pipeline

Speech-to-speech (multimodal)

Dualplex (Beta)

Switching modes

Mode comparison