Google DeepMind Unveils DiffusionGemma AI Model with Fourfold Speed Increase
Google DeepMind has introduced DiffusionGemma, a new AI model within its Gemma open model family. This model differentiates itself by generating text outputs in parallel, a departure from the linear, token-by-token approach of most AI models. This method, which Google DeepMind likens to image generation models that denoise static, reportedly makes DiffusionGemma significantly faster and more efficient on local hardware. DiffusionGemma is designed to boost performance on various GPUs, offering up to four times the output speed compared to similarly sized autoregressive Gemma models. It features 26 billion parameters, with 3.8 billion active during inference, making it suitable for high-end consumer GPUs.

Google DeepMind has released DiffusionGemma, a new artificial intelligence model that employs a unique parallel processing method for text generation, setting it apart from most existing AI models.
Unlike conventional autoregressive models that generate text token by token from left to right, DiffusionGemma operates by producing an entire block of text simultaneously. This approach is similar to how image generation models create content by denoising an initial field of static. DiffusionGemma iteratively refines a canvas of placeholder tokens, using estimated tokens to improve subsequent estimations before finalizing its output.
This parallel generation capability enables DiffusionGemma to achieve a substantial speed increase. Google DeepMind reports that the model can produce outputs up to four times faster than autoregressive Gemma models of comparable size. The design also enhances efficiency when running on local hardware, including high-end gaming GPUs and specialized AI accelerators.
In terms of scale, DiffusionGemma is a Mixture of Experts (MoE) model, comprising 26 billion parameters in total. However, only 3.8 billion parameters are activated during inference, allowing it to fit within the typical 18GB RAM allocation of a high-end GPU.
Performance tests indicate impressive output rates: DiffusionGemma can generate approximately 700 tokens per second when running on an RTX 5090 GPU and over 1,000 tokens per second with a single Nvidia H100 AI accelerator.
According to Ars Technica, DiffusionGemma represents a significant advancement in local AI processing speed and efficiency.
Advertisement
AdSense slot • inline

