DiffusionGemma is Google's experimental text diffusion model built on Gemma 4.
DiffusionGemma generates text by refining a block of tokens in parallel instead of writing one token at a time. This guide explains the architecture, the serving path, and the fastest way to start experimenting.
What makes DiffusionGemma different from autoregressive language models
Built on Gemma 4 MoE
DiffusionGemma uses a 26B Mixture-of-Experts Gemma 4 backbone while activating roughly 4B parameters during inference.
Open weights for developers
The model weights are available on Hugging Face, giving researchers and builders a direct path to local experiments.
Parallel block generation
Instead of loading weights for every next token, it refines a 256-token canvas in parallel and shifts more work to compute.
This page is built for fast understanding
If you want the definition, the architecture, and the shortest path to serving DiffusionGemma, the key details are all here.
The diffusion process denoises blocks of text instead of decoding strictly left to right
DiffusionGemma combines causal prefill for committed context with bidirectional denoising over the current token canvas.
Prefill the prompt context
The model ingests the prompt with causal attention and writes the prompt context into the KV cache before denoising starts.
Initialize a token canvas
The sampler starts with a 256-token canvas of placeholders, then updates all positions in parallel over multiple denoising steps.
Bidirectional denoising
During denoising, each canvas position can attend to the other positions, which enables global context propagation and self-correction.
Commit and continue
When a block stabilizes, it is committed to the KV cache and the next 256-token canvas is initialized for longer generations.
It brings diffusion-style decoding to open text generation.
DiffusionGemma changes the generation loop: it refines whole token blocks, uses bidirectional context inside each block, and can revisit uncertain positions before committing output.
DiffusionGemma: the official developer guide
Google's guide explains the block-autoregressive denoising loop, the 256-token canvas, and the deployment path for developers.
Read the guideHow to go from reading about DiffusionGemma to serving text
The simplest path is to run the Hugging Face model with vLLM, then call the OpenAI-compatible chat endpoint from your own scripts or tools.
Install dependencies
Install vLLM in a Python environment with a supported GPU stack.
Serve the model
Start a local OpenAI-compatible endpoint using the official Hugging Face model ID.
Call the chat API
Send a normal chat-completions request while the server handles block denoising internally.
Explore serving tradeoffs
Tune denoising steps, max tokens, batching, and hardware choice for latency-sensitive text workflows.
The fastest answers to the questions people ask first
Start here if you want the creator, the architecture, or the serving path without reading the full guide first.
Every claim on this page points back to Google's guide, the model card, or the serving documentation so you can verify details yourself.
Developer Guide
Google's introduction to the architecture, block denoising loop, training recipes, and developer setup.
Model Card
The Hugging Face model page for weights, license, intended use, and implementation details.
vLLM Serving
The serving stack used for an OpenAI-compatible local endpoint in the quick start.