Full-Duplex Speech Models

Full-duplex speech models change the shape of a voice system. A cascaded assistant waits for audio, transcribes it, runs a language model, then synthesizes speech. A duplex model keeps listening while it talks, so interruption, backchanneling, and overlap become model behavior rather than separate product hacks. Moshi is the clean architectural starting point: it models user audio and model audio as parallel streams and reports about 160 ms theoretical latency and 200 ms practical latency for speech-to-speech interaction (Moshi paper).

Moshi architecture overview with parallel user audio, model audio, and inner-monologue streams

Why Duplex Is Different

The old pipeline has three clocks: ASR, LLM, and TTS. Each component can stream, but the whole system still depends on turn segmentation. Someone has to decide when the user stopped speaking, when the model may answer, and what to do when speech overlaps.

Moshi removes that explicit turn boundary. It predicts across parallel streams for user speech, model speech, and an "Inner Monologue" text channel that gives the model a semantic path before it emits audio tokens (Moshi paper). The model has to learn timing, not merely content.

Moshi as the Baseline

Kyutai's Moshi gives the field a concrete recipe for open full-duplex speech. The paper describes a speech-text foundation model for real-time dialogue, with audio tokens generated through a Mimi codec path and semantic text tokens used as an intermediate reasoning stream (Moshi paper). Because Moshi exposes both what the model says and when it speaks, later systems can inherit the duplex substrate and focus on control, persona, or multimodal scope.

Moshi's limitation is also useful. It proves that simultaneous speech-to-speech generation can work, but it does not solve every product problem around role control, voice conditioning, domain knowledge, or visual context. The next wave adds those controls on top of the same duplex idea.

PersonaPlex and Prompt Control

PersonaPlex keeps Moshi's duplex foundation and targets a specific problem: how do you control both the character and the voice without breaking simultaneity? NVIDIA's description uses hybrid prompting, with a text prompt for role or scenario conditioning and a voice prompt for speaker conditioning (PersonaPlex project page, PersonaPlex paper).

The training mixture is unusually informative. PersonaPlex uses 7,303 Fisher conversations totaling 1,217 hours of real unscripted speech, plus 39,322 synthetic assistant conversations totaling 410 hours and 105,410 synthetic customer-service conversations totaling 1,840 hours (PersonaPlex project page). That blend gives the model two things at once: real overlap and backchannel patterns from human conversation, and broader task coverage from synthetic dialogue.

PersonaPlex also reports the kind of latency and interruption metrics that matter for duplex systems. NVIDIA reports 90.8% smooth turn-taking, 95.0% user-interruption handling, 0.170 s average turn-taking latency, and 0.240 s interruption latency on its benchmark comparisons (PersonaPlex project page). Those numbers should not be read as a product-readiness guarantee. They do show the right evaluation target: timing and interruption behavior, not only word error rate or response quality.

Qwen2.5-Omni and Thinker-Talker

Qwen2.5-Omni takes a different route. Its Thinker-Talker architecture separates semantic response generation from audio generation: the Thinker handles text and multimodal reasoning, while the Talker produces speech through a dual-track audio path (Qwen2.5-Omni technical report). The report also describes block-wise streaming encoders, TMRoPE for aligning audio and video timing, and a sliding-window DiT path for streaming speech decoding.

This split matters because duplex systems can block themselves. If the same pathway must reason, align vision, track audio, and synthesize speech, one stream can stall another. Qwen2.5-Omni's design isolates the semantic and acoustic jobs while keeping them synchronized. The price is a more complex scheduler and a larger set of streaming states to manage.

MiniCPM-o and Omni-Flow

MiniCPM-o 4.5 moves the conversation from duplex speech toward omni-flow. Its paper describes visual and audio streams aligned on a shared temporal axis, multimodal encoders feeding the backbone, and interleaved speech-token plus waveform-decoder paths for concurrent speech output (MiniCPM-o 4.5 paper, MiniCPM-o 4.5 model card).

Adding video changes the failure mode. A voice-only duplex model mainly has to synchronize user speech and model speech. An omni-flow model has to keep audio, visual frames, text state, and generated speech aligned under a tight latency budget. That raises token pressure and moves temporal alignment into the core architecture.

Runtime Duplex and Streaming Omni

Not every system called full-duplex implements the same mechanism. VITA gets interruption handling by running two model instances at the same time: one instance generates while another monitors for interruption, then the two swap roles when needed (VITA paper, VITA project page). That is a valid system design, but it is not Moshi's native joint modeling of user and agent audio streams.

Baichuan-Omni is useful for the opposite reason. Its report describes streaming multimodal interaction with audio boundary prediction and image/video features flowing into the model, but the original architecture still treats endpointing and audio-to-model timing as explicit stages (Baichuan-Omni technical report, Baichuan-Omni-1.5 repo). It belongs in the landscape, but not in the same bucket as native duplex speech-to-speech.

Architecture Comparison

System	Architectural center	What it buys	Main cost
Moshi	Parallel user/model audio streams plus Inner Monologue text	Open full-duplex speech-to-speech timing	Limited explicit persona and multimodal control
PersonaPlex	Moshi foundation plus hybrid text and voice prompting	Role control and voice conditioning without giving up interruption handling	Dataset construction and prompt-control complexity
Qwen2.5-Omni	Thinker-Talker split with streaming multimodal encoders	Separate semantic reasoning and speech generation paths	More synchronization state
MiniCPM-o 4.5	Omni-flow alignment across audio and vision	Continuous see-hear-speak interaction	Higher token and alignment pressure
VITA	Two-model runtime duplex with interruption monitoring	Practical interruption handling without native duplex token streams	More serving orchestration
Baichuan-Omni	Streaming multimodal endpointing and speech interaction	Broad audio-vision-text interaction	Less clean as native full-duplex speech modeling

The pattern is separation without reverting to a cascade. Each system keeps streaming interaction alive, but each one chooses a different internal boundary: semantic text before audio, voice prompt beside role prompt, thinker beside talker, or aligned multimodal flows.

Failure Modes

Full-duplex systems fail in ways that turn-based systems can hide. A turn-based assistant can wait, clean up the transcript, and respond after silence. A duplex model has to decide whether an incoming sound is interruption, agreement, background noise, or speech it should ignore. The Moshi paper frames this as a real-time speech-text modeling problem, not a conventional ASR plus TTS problem (Moshi paper).

Prompt control adds another failure mode. PersonaPlex separates voice conditioning from role conditioning, but the model still has to preserve both while handling live speech (PersonaPlex project page). Multimodal systems add visual timing, so the model may answer based on a stale frame or misalign speech with what the camera now sees. Qwen2.5-Omni and MiniCPM-o both spend architectural machinery on streaming alignment for that reason (Qwen2.5-Omni technical report, MiniCPM-o 4.5 paper).

My Take

The interesting frontier is not "make TTS faster." Fast TTS still waits for a turn boundary. The architecture that matters is continuous state: a model that can hear, decide, speak, and revise its timing while the world keeps moving. PersonaPlex shows that persona control can sit on top of duplex speech. Qwen2.5-Omni and MiniCPM-o show that visual context can join the loop. The fragile part is orchestration. Once speech, vision, persona, and safety all stream at once, every extra capability competes for the same latency budget.

Takeaways

Full-duplex speech models work because they train or orchestrate conversation as overlapping streams instead of forcing a clean ASR-to-LLM-to-TTS chain. Moshi supplies the native token-stream pattern. PersonaPlex adds independent role and voice control. Qwen2.5-Omni separates semantic reasoning from speech generation. MiniCPM-o extends the same idea into aligned audio-visual flow. VITA shows a runtime-duplex workaround, and Baichuan-Omni shows why streaming omni-modal interaction should not be casually equated with Moshi-style duplex. The shared trade-off is synchronization: the more the model sees and controls while speaking, the harder it becomes to keep timing, state, and behavior coherent.

References

author: Arii tag: #speech links: [[Csm-1b Architecture]], [[Small LLMs — Use Cases and Limits]], [[World Model V-JEPA 2]], [[Multi-Token Prediction]]