Open-weight MoE

ZAYA1-8B

Zyphra's Apache-2.0 reasoning MoE: 8.4B total parameters with only ~760M active, 16 experts, 131K context, Compressed Convolutional Attention and strong math/code benchmarks. Experimental for local use today: currently needs Zyphra vLLM/Transformers forks; LM Studio/GGUF/MLX support is not yet verified.

32 GB power user 24 GB RAM BF16 (Zyphra fork) Mathematical reasoning research
Parameters
8.4B (760M active, MoE)
Minimum RAM
24 GB
Model size
17 GB
Quantization
BF16 (Zyphra fork)

Can ZAYA1-8B run locally?

ZAYA1-8B belongs on 32 GB machines when you want stronger quality without jumping to server hardware.

Search for Zyphra/ZAYA1-8B in LM Studio or another GGUF-compatible runtime.

chatcodereasoningmathexperimental

Install path

01
Check RAM fitMinimum 24 GB RAM. Start with the BF16 (Zyphra fork) quant.
02
Load the modelSearch Zyphra/ZAYA1-8B in LM Studio.
03
Control locallyUse LocalClaw to manage models, agents, chat, channels and scheduled OpenClaw work.

Strengths

  • Very high intelligence density: 8.4B total with ~760M active parameters
  • Strong mathematics, coding and long-form reasoning benchmarks
  • 131K context window
  • Apache 2.0 license
  • Designed for test-time-compute workflows such as Markovian RSA

Limitations

  • Experimental local runtime support today
  • Currently documented with Zyphra forks of vLLM and Transformers
  • No verified LM Studio, Ollama, llama.cpp, GGUF or MLX support yet
  • BF16 weights are too heavy for a clean Mac mini M4 16 GB setup

Best use cases

  • Mathematical reasoning research
  • Coding and algorithmic problem solving
  • Reasoning benchmark experimentation
  • Server/local lab evaluation with Zyphra runtime forks
  • Future compact on-device MoE experiments once runtimes catch up

Capability profile

speed
7
quality
8
coding
8
reasoning
9

Technical notes

Developer
Zyphra
License
Apache 2.0
Context window
131,072 tokens
Architecture
Sparse MoE with Compressed Convolutional Attention (CCA), 16 experts, top-1 MLP router and learned residual scaling

This model fits these next steps

Hardware fit is based on LocalClaw's RAM tier, model size and quantization metadata. Always leave memory headroom for your OS and runtime.

Similar models to compare

Where to go next