Diffusion

Diffusion models generate by learning how to undo corruption. Training starts with real data, adds noise over many timesteps, and asks a neural network to predict how to move a noisy sample back toward the data distribution. Sampling starts from noise and runs the learned denoiser step by step. That makes diffusion feel different from an autoregressive LLM: the model revises an entire image or latent at each step instead of committing one token at a time (Deep Unsupervised Learning using Nonequilibrium Thermodynamics, DDPM).

Latent diffusion architecture diagram showing pixel-space autoencoder, latent denoising U-Net, cross-attention, and conditioning inputs

The Forward Process

The forward process is fixed, not learned. A schedule adds Gaussian noise to a data point x_0 until late timesteps look close to pure noise. DDPM defines a Markov chain with small noise increments and trains the reverse process to denoise one step at a time (DDPM).

data image or latent x_0
  -> add a little noise
  -> add more noise
  -> ...
  -> near-Gaussian noise x_T

The noise schedule controls how hard each denoising problem becomes. If early steps destroy detail too quickly, the model receives weak training signals for fine structure. If late steps leave too much structure, sampling starts from a prior that does not match the training assumption. Improved DDPMs showed that learned reverse-process variances and better schedules can reduce sampling cost while preserving sample quality (Improved DDPM).

The Reverse Model

The learned network receives a noisy sample x_t, a timestep t, and optional conditioning. It predicts a quantity that lets the sampler estimate a cleaner state. DDPM commonly trains the network to predict the noise that was added, which connects the denoising objective to score matching (DDPM, Score-Based SDEs).

x_t + timestep + condition
  -> denoiser
  -> predicted noise or score
  -> sampler update
  -> x_{t-1}

Score-based generative modeling gives the continuous-time view. The forward process transforms data into a simple prior through a stochastic differential equation, and the reverse-time process uses a learned score field to move samples back toward data (Score-Based SDEs). In engineering terms, the denoiser is the model, and the sampler is the numerical procedure that decides how to use the model's prediction.

U-Net as Denoiser

Image diffusion models often use a U-Net-like denoiser. U-Net was introduced for biomedical image segmentation with a contracting path that captures context and an expanding path that restores spatial resolution through skip connections (U-Net). Diffusion models reuse that shape because denoising needs both global structure and local detail.

The down path sees larger receptive fields and builds coarse structure. The up path recovers spatial precision. Skip connections carry high-resolution features around the bottleneck, which helps the model repair texture and edges while still using broad context. Diffusion models then add timestep embeddings, normalization, attention blocks, and conditioning paths on top of the image-to-image backbone (DDPM, Diffusion Models Beat GANs).

Latent Diffusion

Pixel-space diffusion pays for every denoising step at image resolution. Latent Diffusion Models move the denoising process into the latent space of a pretrained autoencoder, then decode the final latent back to pixels. The LDM paper argues that this reduces computation while preserving enough detail for high-resolution synthesis (Latent Diffusion Models).

image
  -> encoder compresses to latent
  -> diffusion denoises latent
  -> decoder reconstructs image

Latent diffusion also gives text-to-image systems a clean conditioning boundary. A text encoder turns the prompt into vectors. Cross-attention layers let the denoising network read those vectors while it updates the image latent. The LDM paper introduced cross-attention conditioning for inputs such as text, bounding boxes, and semantic maps (Latent Diffusion Models).

Conditioning and Guidance

Conditioning tells the denoiser which region of the data distribution to sample from. Text-to-image systems condition on language embeddings. Class-conditional image models condition on labels. Inpainting systems condition on known pixels and a mask. The architecture changes with the condition, but the core loop stays the same: predict a denoising direction for the current noisy state (Latent Diffusion Models, Imagen).

Classifier-free guidance trains one model to handle conditional and unconditional denoising, then combines the two predictions at sampling time. Higher guidance pushes samples toward the condition, but it can reduce diversity and amplify artifacts when the scale is too aggressive (Classifier-Free Diffusion Guidance).

Samplers

Diffusion quality depends on the sampler as much as the denoiser. DDPM sampling follows a stochastic reverse Markov chain and may require many network evaluations. DDIM uses the same training objective but defines a non-Markovian sampling process that can run much faster, including deterministic sampling when its noise parameter is set to zero (DDIM).

Karras and colleagues framed diffusion as a design space with separable choices: noise parameterization, network preconditioning, training distribution over noise levels, and sampling solver. Their paper shows why sampler design should not be treated as an afterthought bolted onto a fixed model (Elucidating the Design Space of Diffusion-Based Generative Models).

Diffusion and Transformers

Diffusion is a generation process, not a single backbone. Many image diffusion systems use U-Nets, but diffusion transformers replace the U-Net denoiser with a transformer operating over latent patches. DiT reports that larger transformer denoisers with higher forward-pass compute improve class-conditional ImageNet results, showing that diffusion can inherit transformer scaling behavior when the data representation fits token-like patches (DiT).

This is the useful distinction: a transformer is an architecture for mixing token or patch representations; diffusion is a training and sampling framework for reversing corruption. They can combine. A text-to-image model may use a transformer text encoder, cross-attention inside a U-Net, or a full transformer denoiser.

Diffusion vs LLM Generation

A decoder-only LLM samples a token, appends it to the prefix, and repeats. Its causal mask protects the training objective, but it also means generation commits to a growing sequence. Diffusion starts from noise and repeatedly updates the whole sample or latent. It can revise global structure across denoising steps, but each step requires another denoiser call (Attention Is All You Need, DDPM).

Property Autoregressive LLM Diffusion model
State Token prefix Noisy sample or latent
Basic step Predict next token Predict denoising direction
Dependency Left-to-right causal sequence Iterative refinement over the full sample
Main serving cost One forward pass per generated token plus KV cache Many denoiser evaluations per sample
Natural fit Text and discrete sequences Images, audio, video, continuous latents

The comparison is not a winner-take-all choice. Text can be modeled with diffusion-like objectives, and images can be modeled autoregressively. The default pairings exist because token prefixes suit language and spatial denoising suits image synthesis.

Failure Modes

Diffusion systems fail when the conditioning path and denoising path disagree. A text encoder may represent the prompt poorly. Cross-attention may bind attributes to the wrong object. A high guidance scale may make the sample obey the prompt at the cost of texture, diversity, or natural composition. Fast samplers may skip steps that the denoiser needed for fine corrections (Classifier-Free Diffusion Guidance, DDIM, Elucidating the Design Space of Diffusion-Based Generative Models).

My take: diffusion is strong because it separates representation, denoising, conditioning, and sampling. That modularity also creates product risk. If a team tunes only prompts while ignoring the text encoder, guidance scale, latent resolution, sampler, and safety filters, it will mistake pipeline behavior for model intelligence.

Takeaways

Diffusion models learn a reverse path from noise to data. U-Nets give image diffusion a denoising backbone with local detail and global context. Latent diffusion moves the expensive loop into compressed space. Cross-attention gives text a route into the denoiser. Samplers decide how much compute the model spends at inference time. Compared with LLMs, diffusion trades left-to-right commitment for iterative global refinement, and it pays for that flexibility with repeated network evaluations.

References

  • Sohl-Dickstein et al., "Deep Unsupervised Learning using Nonequilibrium Thermodynamics"
  • Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models"
  • Nichol and Dhariwal, "Improved Denoising Diffusion Probabilistic Models"
  • Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations"
  • Ronneberger, Fischer, and Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation"
  • Dhariwal and Nichol, "Diffusion Models Beat GANs on Image Synthesis"
  • Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models"
  • Ho and Salimans, "Classifier-Free Diffusion Guidance"
  • Song, Meng, and Ermon, "Denoising Diffusion Implicit Models"
  • Karras et al., "Elucidating the Design Space of Diffusion-Based Generative Models"
  • Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding"
  • Peebles and Xie, "Scalable Diffusion Models with Transformers"
  • Vaswani et al., "Attention Is All You Need"
  • Wikimedia Commons, "Diffusion Architecture.png"

author: Arii tag: #ai, #architecture links: [[Transformers]], [[World Model V-JEPA 2]], [[Full-Duplex Speech Models]]