Local TTS model

Moshi

Full-duplex spoken dialogue model - listens and speaks simultaneously with ~160 ms latency. Not just a TTS but a real-time conversational speech model. Runs on a single L4 GPU or Mac M3 Pro.

GPU recommended text-to-speech generation 1 languages CC-BY-4.0
Quality
9/10
Speed
9.5/10
Model size
7.5 GB
Voices
Full-duplex conversational

Can Moshi run locally?

Moshi can generate speech locally for private voice workflows. Start with pip install moshi.

CC-BY-4.0 license. Review upstream restrictions before commercial use.

dialoguestreamingrealtimelow-latencyemotion

Audio profile

Quality
9
Speed
9.5
Local
8.7

Best fit

Moshi is best for fast on-device voice responses and local assistants.

Hardware: gpuapple

Model details

Type
Local TTS model
Family
moshi
Latency
ultra-low
Formats
pytorchsafetensorsmlx
Languages
en
Context
Listens + speaks simultaneously, 160 ms latency

Install locally

01
Check runtimeConfirm the backend supports pytorch, safetensors, mlx on your machine.
02
Install modelUse the upstream command or repository instructions.
03
Test locallyRun a short private audio prompt before moving into production workflows.
pip install moshi

Good for

  • text-to-speech generation
  • GPU recommended local workflows
  • dialogue, streaming, realtime

Watch before shipping

  • Validate pronunciation, latency and artifacts with your own voice samples.
  • Review the upstream license and acceptable-use notes.
  • Benchmark on your target CPU, Apple Silicon or GPU setup.

Related TTS and speech models

CompareBrowse all TTS models Local AIBrowse LLM models macOS appGet LocalClaw