🎨 Diffusion Models

How Stable Diffusion creates stunning images

The New State-of-the-Art

Diffusion Models have revolutionized image generation, powering tools like Stable Diffusion, DALL-E 2, and Midjourney. They create images by gradually removing noise in an iterative process.

🧠 Core Concept

The Diffusion Process

Forward Process (Training): Gradually add noise to images until pure noise

Reverse Process (Generation): Learn to remove noise step by step

# Simplified concept
Clean Image → +noise → +noise → ... → Pure Noise  (Forward - Training)
Pure Noise → -noise → -noise → ... → Clean Image  (Reverse - Generation)

# The model learns: "What noise was added at each step?"
# Then reverses it to generate new images!

🏗️ How Stable Diffusion Works

Step 1: Text Encoding

from transformers import CLIPTokenizer, CLIPTextModel

prompt = "A cat astronaut in space, digital art"

# Tokenize text
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_input = tokenizer(prompt, padding="max_length", 
                       max_length=77, return_tensors="pt")

# Encode to embeddings
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
text_embeddings = text_encoder(text_input.input_ids)[0]
# Shape: [1, 77, 768]

Step 2: Start with Random Noise

import torch

# Random latent noise (compressed representation)
latent = torch.randn(1, 4, 64, 64)  
# Will become 512x512 image after VAE decoding

Step 3: Iterative Denoising (50 steps)

from diffusers import UNet2DConditionModel
from diffusers.schedulers import DDPMScheduler

unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
scheduler = DDPMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

# Denoising loop
for t in scheduler.timesteps:
    # Predict noise
    with torch.no_grad():
        noise_pred = unet(
            latent, 
            t, 
            encoder_hidden_states=text_embeddings
        ).sample
    
    # Remove predicted noise
    latent = scheduler.step(noise_pred, t, latent).prev_sample
    
# After 50 steps: clear image in latent space!

Step 4: Decode to Pixels

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")

# Decode latent to image
with torch.no_grad():
    image = vae.decode(latent / 0.18215).sample

# Convert to PIL Image
from PIL import Image
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).astype("uint8"))
image.save("cat_astronaut.png")

🚀 Using Stable Diffusion (Easy Way)

from diffusers import StableDiffusionPipeline
import torch

# Load model
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")  # Use GPU

# Generate image
prompt = "A serene mountain landscape at sunset, photorealistic, 4k"
negative_prompt = "blurry, low quality, distorted"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    guidance_scale=7.5,
    height=768,
    width=768
).images[0]

image.save("mountain_sunset.png")

🎛️ Key Parameters

guidance_scale

Range: 1-20

Default: 7.5

  • Low (1-5): Creative, less adherent to prompt
  • Medium (7-10): Balanced
  • High (15-20): Very literal, may oversaturate

num_inference_steps

Range: 20-150

Default: 50

  • 20-30: Fast, lower quality
  • 50: Good quality, reasonable speed
  • 100+: Best quality, slow

negative_prompt

What to avoid in generation

  • "blurry, low quality"
  • "deformed, ugly"
  • "text, watermark"
  • "bad anatomy"

seed

Reproducibility control

  • Fixed seed: Same image every time
  • Random seed: Different results
  • Use for iterations on good results
generator = torch.Generator().manual_seed(42)

✍️ Prompt Engineering for Images

Anatomy of a Great Prompt

# Structure: [Subject] + [Style] + [Quality] + [Details]

prompt = """
A majestic dragon,                     # Subject
flying over a medieval castle,         # Setting
oil painting style,                     # Style
highly detailed, 8k resolution,        # Quality
dramatic lighting, fantasy art,        # Details
by Greg Rutkowski                       # Artist reference
"""

# More specific = better results!

Style Modifiers

Art Styles

  • oil painting
  • watercolor
  • digital art
  • pixel art
  • anime style

Quality Tags

  • highly detailed
  • 8k resolution
  • photorealistic
  • masterpiece
  • professional

Lighting

  • dramatic lighting
  • golden hour
  • studio lighting
  • cinematic
  • volumetric lighting

🔧 Advanced Features

Image-to-Image

from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

pipe = StableDiffusionImg2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
pipe = pipe.to("cuda")

# Load input image
init_image = Image.open("sketch.png").convert("RGB")
init_image = init_image.resize((768, 768))

# Transform it
prompt = "A photorealistic version of this sketch"
image = pipe(
    prompt=prompt,
    image=init_image,
    strength=0.75,  # How much to change (0-1)
    guidance_scale=7.5
).images[0]

Inpainting (Edit Parts)

from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained("stabilityai/stable-diffusion-2-inpainting")
pipe = pipe.to("cuda")

# Load image and mask
image = Image.open("photo.png")
mask = Image.open("mask.png")  # White = area to replace

prompt = "A red sports car"
result = pipe(
    prompt=prompt,
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

Upscaling

from diffusers import StableDiffusionUpscalePipeline

pipe = StableDiffusionUpscalePipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler")
pipe = pipe.to("cuda")

low_res_image = Image.open("low_res.png")

upscaled = pipe(
    prompt="high quality, detailed",
    image=low_res_image
).images[0]

# 4x resolution increase!

⚡ Optimization Tips

Speed Up Generation

  • Use float16 instead of float32
  • Reduce inference steps (50→30)
  • Use smaller models (SD 1.5 vs 2.1)
  • Enable xformers memory-efficient attention
  • Use DPM-Solver++ scheduler (faster)

Improve Quality

  • Increase inference steps (50→100)
  • Fine-tune guidance_scale
  • Use detailed prompts
  • Add quality keywords
  • Use negative prompts effectively
# Memory-efficient setup
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,  # Half precision
    use_safetensors=True
)
pipe = pipe.to("cuda")
pipe.enable_attention_slicing()  # Reduce VRAM
pipe.enable_vae_slicing()        # Further reduction

# Faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Now generate faster with same quality!

🎯 Key Takeaways