🤖 Machine Learning · Generative AI
📅 Березень 2026 ⏱ ≈ 11 хв читання 🔴 Advanced

How AI Generates Images — Diffusion Models Explained

Type "an astronaut riding a horse, oil painting, golden hour" and receive a photorealistic image in seconds. Stable Diffusion, DALL-E, and Midjourney all use the same underlying idea: learn to reverse a process that gradually destroys an image with noise.

1. The Big Picture

The key insight from Ho et al. (DDPM, 2020): instead of training a network to generate images in one shot, train it to remove a small amount of Gaussian noise from a slightly-noisy image. Repeat this ~1000 times, starting from pure noise.

This turns a hard problem (generate realistic images) into a curriculum of easy problems (remove noise). The result, surprisingly, produces sharper and more diverse images than GANs — without the training instability.

🖼️
Clean image
t=250
slight noise
t=500
half noise
t=750
mostly noise
t=1000
pure noise

Generation runs these steps in reverse: start from pure Gaussian noise and iteratively denoise into a coherent image.

2. Forward Diffusion — Adding Noise

At each timestep t, a small amount of Gaussian noise is added according to a variance schedule β_t (typically 0.0001 → 0.02 over T=1000 steps):

Forward process (Markov chain) q(x_t | x_{t-1}) = 𝒩(x_t ; √(1−β_t)·x_{t-1} , β_t·I)

Using the reparametrisation trick, we can sample directly at any step:

x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε , where ε ~ 𝒩(0, I)
ᾱ_t = ∏_{s=1}^{t} (1 − β_s)

This closed-form expression is critical: during training we can jump directly to any noise level without running all t steps.

3. Reverse Diffusion — Denoising

The true reverse process p(x_{t-1} | x_t) — going from noisy to clean — requires knowing exactly what image produced the noise. This is intractable. Instead, we train a neural network ε_θ to approximate it.

Reverse process (learned) p_θ(x_{t-1} | x_t) = 𝒩(x_{t-1} ; μ_θ(x_t, t) , Σ_θ(x_t, t))

μ_θ(x_t, t) = (1/√α_t) · [x_t − β_t/√(1−ᾱ_t) · ε_θ(x_t, t)]

The network ε_θ takes the noisy image x_t and timestep t as input and predicts the noise ε added at that step. Once trained, we iteratively apply the reverse formula starting from pure noise.

4. Training the Denoiser

Training is surprisingly simple. For each image in the dataset:

  1. Sample a random timestep t ~ Uniform(1, T).
  2. Sample random noise ε ~ 𝒩(0, I).
  3. Compute the noisy image: x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε.
  4. Run the network: ε̂ = ε_θ(x_t, t).
  5. Minimise: ‖ε − ε̂‖² (predict the noise that was added).
Why predict noise instead of the clean image? Empirically, noise prediction produces better results because the target (noise) has roughly constant scale, while the target (clean image) can have very different scales across timesteps.

5. The U-Net Architecture

The denoising network is almost always a U-Net: a convolutional architecture with a contracting encoder path, a bottleneck, and a symmetric expanding decoder path. Skip connections between matching encoder and decoder levels preserve fine spatial detail.

Key additions for diffusion:

Modern models (Stable Diffusion 3, DiT) replace the U-Net with pure Transformer architectures operating on patch sequences.

6. Latent Diffusion — Working in Compressed Space

Pixel-space diffusion is extremely expensive: a 512×512 image has 786 432 pixels per channel, each needing 1000 denoising steps. The solution (Rombach et al. 2022, the "Stable Diffusion" paper) is to work in latent space:

  1. Train an autoencoder (VAE) to compress 512×512 images to 64×64 latent tensors (8× compression per side).
  2. Run the entire diffusion process on the 64×64 latent — 64× fewer pixels.
  3. At inference, decode the final latent back to pixels with the VAE decoder.

This reduces compute by roughly 64× with little quality loss, enabling 1024×1024 generation to run on consumer GPUs.

Stable Diffusion 1.5 numbers: 512×512 → 64×64×4 latent. 860M U-Net parameters. ~1 second for 20 denoising steps on an RTX 3090.

7. Text Conditioning with CLIP

To control what image is generated, the denoising U-Net also receives a text embedding via cross-attention. The text encoder is typically the text tower of a CLIP model (Contrastive Language–Image Pre-training).

CLIP is trained on (image, caption) pairs using a contrastive loss — matching images to their correct descriptions from a batch of negatives. The resulting text embeddings encode rich semantic content that the diffusion model can steer toward.

In the U-Net layers: Q = latent features, K = V = text embeddings. Each spatial location "attends" to the most relevant text tokens, injecting semantic guidance at every denoising step.

8. Classifier-Free Guidance

Even with text conditioning, early models produced images that loosely matched the prompt. Classifier-free guidance (CFG) amplifies the text influence at inference time:

CFG formula ε̂_guided = ε_θ(x_t, ∅) + w · (ε_θ(x_t, c) − ε_θ(x_t, ∅))

c — text prompt embedding
∅ — empty/null prompt (unconditional)
w — guidance scale (typically 7–15)

The denoiser is run twice per step: once with the text prompt, once without. The difference is amplified by scale w and added back. Higher w means stronger prompt adherence but less diversity and potential artefacts.

The "guidance scale" slider in Stable Diffusion UIs is exactly this parameter w. Value 1 = no guidance. Value 7.5 = balanced. Value 20+ = cartoonish over-saturation.

9. Sampling Schedulers

The original DDPM (Ho et al.) requires ~1000 denoising steps. Subsequent advances in sampling schedulers dramatically reduced this:

The field continues to improve — 4-8 steps is now achievable with techniques like Consistency Models and Adversarial Diffusion Distillation (used in Stable Diffusion Turbo).