The New State-of-the-Art
Diffusion Models have revolutionized image generation, powering tools like Stable Diffusion, DALL-E 2, and Midjourney. They create images by gradually removing noise in an iterative process.
🧠 Core Concept
The Diffusion Process
Forward Process (Training): Gradually add noise to images until pure noise
Reverse Process (Generation): Learn to remove noise step by step
# Simplified concept
Clean Image → +noise → +noise → ... → Pure Noise (Forward - Training)
Pure Noise → -noise → -noise → ... → Clean Image (Reverse - Generation)
# The model learns: "What noise was added at each step?"
# Then reverses it to generate new images!
🏗️ How Stable Diffusion Works
Step 1: Text Encoding
from transformers import CLIPTokenizer, CLIPTextModel
prompt = "A cat astronaut in space, digital art"
# Tokenize text
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_input = tokenizer(prompt, padding="max_length",
max_length=77, return_tensors="pt")
# Encode to embeddings
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
text_embeddings = text_encoder(text_input.input_ids)[0]
# Shape: [1, 77, 768]
Step 2: Start with Random Noise
import torch
# Random latent noise (compressed representation)
latent = torch.randn(1, 4, 64, 64)
# Will become 512x512 image after VAE decoding
Step 3: Iterative Denoising (50 steps)
from diffusers import UNet2DConditionModel
from diffusers.schedulers import DDPMScheduler
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
scheduler = DDPMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")
# Denoising loop
for t in scheduler.timesteps:
# Predict noise
with torch.no_grad():
noise_pred = unet(
latent,
t,
encoder_hidden_states=text_embeddings
).sample
# Remove predicted noise
latent = scheduler.step(noise_pred, t, latent).prev_sample
# After 50 steps: clear image in latent space!
Step 4: Decode to Pixels
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
# Decode latent to image
with torch.no_grad():
image = vae.decode(latent / 0.18215).sample
# Convert to PIL Image
from PIL import Image
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).astype("uint8"))
image.save("cat_astronaut.png")
🚀 Using Stable Diffusion (Easy Way)
from diffusers import StableDiffusionPipeline
import torch
# Load model
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda") # Use GPU
# Generate image
prompt = "A serene mountain landscape at sunset, photorealistic, 4k"
negative_prompt = "blurry, low quality, distorted"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50,
guidance_scale=7.5,
height=768,
width=768
).images[0]
image.save("mountain_sunset.png")
🎛️ Key Parameters
guidance_scale
Range: 1-20
Default: 7.5
- Low (1-5): Creative, less adherent to prompt
- Medium (7-10): Balanced
- High (15-20): Very literal, may oversaturate
num_inference_steps
Range: 20-150
Default: 50
- 20-30: Fast, lower quality
- 50: Good quality, reasonable speed
- 100+: Best quality, slow
negative_prompt
What to avoid in generation
- "blurry, low quality"
- "deformed, ugly"
- "text, watermark"
- "bad anatomy"
seed
Reproducibility control
- Fixed seed: Same image every time
- Random seed: Different results
- Use for iterations on good results
generator = torch.Generator().manual_seed(42)
✍️ Prompt Engineering for Images
Anatomy of a Great Prompt
# Structure: [Subject] + [Style] + [Quality] + [Details]
prompt = """
A majestic dragon, # Subject
flying over a medieval castle, # Setting
oil painting style, # Style
highly detailed, 8k resolution, # Quality
dramatic lighting, fantasy art, # Details
by Greg Rutkowski # Artist reference
"""
# More specific = better results!
Style Modifiers
Art Styles
- oil painting
- watercolor
- digital art
- pixel art
- anime style
Quality Tags
- highly detailed
- 8k resolution
- photorealistic
- masterpiece
- professional
Lighting
- dramatic lighting
- golden hour
- studio lighting
- cinematic
- volumetric lighting
🔧 Advanced Features
Image-to-Image
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
pipe = StableDiffusionImg2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
pipe = pipe.to("cuda")
# Load input image
init_image = Image.open("sketch.png").convert("RGB")
init_image = init_image.resize((768, 768))
# Transform it
prompt = "A photorealistic version of this sketch"
image = pipe(
prompt=prompt,
image=init_image,
strength=0.75, # How much to change (0-1)
guidance_scale=7.5
).images[0]
Inpainting (Edit Parts)
from diffusers import StableDiffusionInpaintPipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained("stabilityai/stable-diffusion-2-inpainting")
pipe = pipe.to("cuda")
# Load image and mask
image = Image.open("photo.png")
mask = Image.open("mask.png") # White = area to replace
prompt = "A red sports car"
result = pipe(
prompt=prompt,
image=image,
mask_image=mask,
num_inference_steps=50
).images[0]
Upscaling
from diffusers import StableDiffusionUpscalePipeline
pipe = StableDiffusionUpscalePipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler")
pipe = pipe.to("cuda")
low_res_image = Image.open("low_res.png")
upscaled = pipe(
prompt="high quality, detailed",
image=low_res_image
).images[0]
# 4x resolution increase!
⚡ Optimization Tips
Speed Up Generation
- Use float16 instead of float32
- Reduce inference steps (50→30)
- Use smaller models (SD 1.5 vs 2.1)
- Enable xformers memory-efficient attention
- Use DPM-Solver++ scheduler (faster)
Improve Quality
- Increase inference steps (50→100)
- Fine-tune guidance_scale
- Use detailed prompts
- Add quality keywords
- Use negative prompts effectively
# Memory-efficient setup
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16, # Half precision
use_safetensors=True
)
pipe = pipe.to("cuda")
pipe.enable_attention_slicing() # Reduce VRAM
pipe.enable_vae_slicing() # Further reduction
# Faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Now generate faster with same quality!
🎯 Key Takeaways
- Diffusion models create images by iteratively removing noise
- Stable Diffusion is open-source and runs locally
- guidance_scale controls prompt adherence (7.5 is good default)
- Detailed prompts with style/quality keywords produce better results
- Negative prompts are essential for avoiding unwanted elements
- Image-to-image and inpainting enable editing existing images