Case Study 2: Controlled Image Generation with Stable Diffusion

Overview

In this case study, you will explore the full Stable Diffusion pipeline for controlled image generation. Using pre-trained models from Hugging Face's diffusers library, you will implement text-to-image generation with classifier-free guidance, image-to-image translation, and ControlNet-guided generation. You will also investigate the effect of different parameters on output quality and learn practical techniques for achieving specific visual results.

Problem Statement

Build a comprehensive image generation toolkit that demonstrates the controllability of Stable Diffusion across three tasks:

Text-to-image: Generate high-quality images from text prompts with various guidance scales
Image-to-image: Transform existing images based on text prompts while preserving structure
ControlNet: Generate images guided by spatial control signals (edge maps)

Approach

Step 1: Text-to-Image with Guidance Scale Analysis

We use Stable Diffusion v1.5 with the PNDM scheduler (a fast multi-step scheduler) and systematically analyze the effect of classifier-free guidance:

Fix a prompt ("a photograph of a castle on a cliff overlooking the ocean at sunset") and a random seed
Generate images at guidance scales: 1.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0
Measure CLIP score (text-image alignment) and visual quality for each

Expected findings: Quality and text alignment improve from $w=1$ to $w=7.5$, then artifacts appear at $w=15+$ (oversaturation, distortion).

Step 2: Prompt Engineering Techniques

We explore how prompt structure affects generation:

Positive prompts: "masterpiece, best quality, highly detailed, 8k" improves quality
Negative prompts: "blurry, low quality, distorted, watermark" avoids common artifacts
Prompt weighting: Using parentheses to emphasize specific elements: "(red:1.5) car"
Token position: Important concepts placed earlier in the prompt receive more attention

We generate a grid of images comparing different prompt strategies on the same seed and base concept.

Step 3: Image-to-Image Translation

The img2img pipeline adds noise to a source image and denoises with a new prompt:

Source: A photograph of a building
Target prompts: "a watercolor painting of a building," "a building in winter with snow"
Strength parameter sweep: 0.2, 0.4, 0.6, 0.8, 1.0
Low strength preserves the source structure; high strength allows more creative freedom

Step 4: ControlNet with Canny Edges

ControlNet enables precise spatial control:

Load a reference image and extract Canny edges using OpenCV
Use ControlNet with Canny edge conditioning to generate new images that follow the edge structure
Vary the ControlNet conditioning scale (0.5, 0.75, 1.0, 1.5) to control how strictly edges are followed
Combine ControlNet with different text prompts to change the style while preserving structure

Step 5: Combining Multiple Control Techniques

We demonstrate combining ControlNet with image-to-image:

Start with a photograph
Extract Canny edges
Use ControlNet + img2img to transform the image with both structural and textual guidance
Compare with ControlNet alone and img2img alone

Results

Guidance Scale Analysis

Guidance Scale	CLIP Score	Visual Quality	Notes
1.0	0.21	Low	Blurry, unfocused
3.0	0.26	Medium	Recognizable but generic
5.0	0.29	Good	Well-formed, moderate detail
7.5	0.31	Best	Sharp, detailed, coherent
10.0	0.32	Good	Slightly oversaturated
15.0	0.30	Degraded	Oversaturated colors
20.0	0.27	Poor	Severe artifacts, distortion

Image-to-Image Strength Analysis

Strength 0.2: Almost identical to the source image, minimal change
Strength 0.4: Subtle style transfer while preserving most structure
Strength 0.6: Clear transformation with recognizable source structure
Strength 0.8: Significant creative reinterpretation, loose structural similarity
Strength 1.0: Equivalent to text-to-image (source image fully overwritten)

ControlNet Results

ControlNet successfully constrains generation to follow edge maps while allowing full stylistic freedom. Conditioning scale of 1.0 provides the best balance between edge adherence and natural appearance. Scale 1.5 produces exact edge following but can appear rigid.

Key Lessons

Guidance scale 7.5 is a good default for most prompts, but artistic subjects may benefit from lower values (5-6) for more creative variation.
Negative prompts are essential for production-quality results. They steer the model away from common failure modes more effectively than positive prompts alone.
Image-to-image strength is non-linear: The visual change between 0.4 and 0.6 is much larger than between 0.2 and 0.4, because the model needs a critical amount of noise to deviate from the source.
ControlNet excels at structure preservation: For tasks like architectural rendering or pose-guided generation, ControlNet provides a level of spatial control that text prompts alone cannot achieve.
Scheduler choice matters for speed: PNDM and DPM-Solver++ produce comparable quality at 20-30 steps, while the Euler ancestral scheduler may need 30-50 steps for equivalent quality.
Seed reproducibility: Fixed seeds enable systematic comparison of parameters but different schedulers produce different images even with the same seed.

Ethical Considerations

Always respect copyright when using reference images for img2img or ControlNet
Disclose AI-generated content when sharing or publishing
Be mindful of generating images of real people without consent
Commercial use of Stable Diffusion outputs should comply with the model's license terms

Code Reference

The complete implementation is available in code/case-study-code.py.