Local TTS model
Moshi
Full-duplex spoken dialogue model - listens and speaks simultaneously with ~160 ms latency. Not just a TTS but a real-time conversational speech model. Runs on a single L4 GPU or Mac M3 Pro.
GPU recommended
text-to-speech generation
1 languages
CC-BY-4.0
Quality
9/10
Speed
9.5/10
Model size
7.5 GB
Voices
Full-duplex conversational
Can Moshi run locally?
Moshi can generate speech locally for private voice workflows. Start with pip install moshi.
CC-BY-4.0 license. Review upstream restrictions before commercial use.
pip install moshi
Upstream source
dialoguestreamingrealtimelow-latencyemotion
Audio profile
Best fit
Moshi is best for fast on-device voice responses and local assistants.
Hardware: gpuapple
Model details
Type
Local TTS model
Family
moshi
Latency
ultra-low
Formats
pytorchsafetensorsmlx
Languages
en
Context
Listens + speaks simultaneously, 160 ms latency
Install locally
01
Check runtimeConfirm the backend supports pytorch, safetensors, mlx on your machine.02
Install modelUse the upstream command or repository instructions.03
Test locallyRun a short private audio prompt before moving into production workflows.pip install moshi
Good for
- text-to-speech generation
- GPU recommended local workflows
- dialogue, streaming, realtime
Watch before shipping
- Validate pronunciation, latency and artifacts with your own voice samples.
- Review the upstream license and acceptable-use notes.
- Benchmark on your target CPU, Apple Silicon or GPU setup.
Related TTS and speech models
Sesame AI
Sesame CSM
Local TTS model · Q 9.5 · Speed 7.5
hexgrad
Kokoro TTS
Local TTS model · Q 9.2 · Speed 9.8
Alibaba Cloud (Qwen Team)
Qwen3 TTS
Local TTS model · Q 9.5 · Speed 8.5
Zyphra
Zonos v0.1
Local TTS model · Q 9.5 · Speed 8.5
Neuphonic
NeuTTS Air
Local TTS model · Q 9 · Speed 9.5
Alibaba FunAudioLLM
CosyVoice 2
Local TTS model · Q 9.3 · Speed 8.8
Microsoft Research
VibeVoice Realtime 0.5B
Local TTS model · Q 9.1 · Speed 9.2
Supertone
Supertonic 3
Local TTS model · Q 8.8 · Speed 9.8