Google DeepMind Releases DiffusionGemma for Fast Local AI
What happened
Google DeepMind released DiffusionGemma — an experimental 26B-parameter MoE model (3.8B active) with a diffusion head that generates text in parallel blocks instead of autoregressive token-by-token decoding. NVIDIA simultaneously optimised it for RTX GPUs and DGX Spark for local inference.
Context and impact
Diffusion-based text generation is a long-studied alternative to autoregressive transformers (Inception Labs Mercury, Stanford SEDD), but hasn't had a strong production implementation until now. DiffusionGemma is the first mainstream MoE model with this architecture to ship as open weights. For local AI it could be a breakthrough: parallel block decoding is significantly faster than token-by-token, especially on consumer GPUs with limited memory bandwidth.
Details
- Architecture: 26B parameters, 3.8B active (MoE)
- Generates text in parallel blocks via a diffusion head
- Optimisation: NVIDIA RTX GPUs and DGX Spark workstation
- Classification: experimental — Google positions it as a research preview
- Part of the open-weight Gemma 3 family