Local TTS model

Step-Audio 2 Mini

Open-source multi-modal speech LLM. Unified understanding + generation in one model - ASR, TTS, voice conversion, speech dialogue. Strong expressive control and paralinguistic features. Available in Mini (8B) and Full variants.

GPU recommended text-to-speech generation 3 languages Apache 2.0
Quality
9.3/10
Speed
7.5/10
Model size
4.8 GB
Voices
Multi-speaker + voice conversion

Can Step-Audio 2 Mini run locally?

Step-Audio 2 Mini can generate speech locally for private voice workflows. Start with pip install step-audio.

Apache 2.0 license. Still verify upstream usage notes before shipping.

cloningdialogueemotionstreamingmultilingual

Audio profile

Quality
9.3
Speed
7.5
Local
8.5

Best fit

Step-Audio 2 Mini is best for local voice cloning and expressive speech generation.

Hardware: gpuapple

Model details

Type
Local TTS model
Family
step
Latency
low
Formats
pytorchsafetensors
Languages
en, zh, ja
Context
Unified speech LLM (ASR + TTS + dialogue)

Install locally

01
Check runtimeConfirm the backend supports pytorch, safetensors on your machine.
02
Install modelUse the upstream command or repository instructions.
03
Test locallyRun a short private audio prompt before moving into production workflows.
pip install step-audio

Good for

  • text-to-speech generation
  • GPU recommended local workflows
  • cloning, dialogue, emotion

Watch before shipping

  • Validate pronunciation, latency and artifacts with your own voice samples.
  • Review the upstream license and acceptable-use notes.
  • Benchmark on your target CPU, Apple Silicon or GPU setup.

Related TTS and speech models

CompareBrowse all TTS models Local AIBrowse LLM models macOS appGet LocalClaw