Google DeepMind Unveils DiffusionGemma AI Model with Fourfold Speed Increase

Google DeepMind has introduced DiffusionGemma, a new AI model within its Gemma open model family. This model differentiates itself by generating text outputs in parallel, a departure from the linear, token-by-token approach of most AI models. This method, which Google DeepMind likens to image generation models that denoise static, reportedly makes DiffusionGemma significantly faster and more efficient on local hardware. DiffusionGemma is designed to boost performance on various GPUs, offering up to four times the output speed compared to similarly sized autoregressive Gemma models. It features 26 billion parameters, with 3.8 billion active during inference, making it suitable for high-end consumer GPUs.

By Fainaron·Jun 10, 2026 (4 days ago)·2 views

Google DeepMind Unveils DiffusionGemma AI Model with Fourfold Speed Increase

Google DeepMind has released DiffusionGemma, a new artificial intelligence model that employs a unique parallel processing method for text generation, setting it apart from most existing AI models.

Unlike conventional autoregressive models that generate text token by token from left to right, DiffusionGemma operates by producing an entire block of text simultaneously. This approach is similar to how image generation models create content by denoising an initial field of static. DiffusionGemma iteratively refines a canvas of placeholder tokens, using estimated tokens to improve subsequent estimations before finalizing its output.

This parallel generation capability enables DiffusionGemma to achieve a substantial speed increase. Google DeepMind reports that the model can produce outputs up to four times faster than autoregressive Gemma models of comparable size. The design also enhances efficiency when running on local hardware, including high-end gaming GPUs and specialized AI accelerators.

In terms of scale, DiffusionGemma is a Mixture of Experts (MoE) model, comprising 26 billion parameters in total. However, only 3.8 billion parameters are activated during inference, allowing it to fit within the typical 18GB RAM allocation of a high-end GPU.

Performance tests indicate impressive output rates: DiffusionGemma can generate approximately 700 tokens per second when running on an RTX 5090 GPU and over 1,000 tokens per second with a single Nvidia H100 AI accelerator.

According to Ars Technica, DiffusionGemma represents a significant advancement in local AI processing speed and efficiency.

AdSense slot • inline

#google deepmind #diffusiongemma #ai model #text generation #local ai #machine learning #gpu acceleration #parallel processing

Source attribution: This article was AI-curated and rewritten by Fainaron from a piece originally published by Ars Technica. Read the original at Ars Technica →

Google DeepMind Unveils DiffusionGemma AI Model with Fourfold Speed Increase

More like this

Apple MacBook Neo's Budget Appeal Endures Three Months After Launch

Škoda Peaq Electric Vehicle to Debut as Flagship Model This Summer

Wired Identifies 7 Best Coffee Makers for 2026, Including Ratio, Fellow, and Moccamaster

Anthropic's Claude Fable 5 Banned by Trump Administration Amid Comparison with OpenAI's GPT 5.5

Fainaron — live counters