Case Study 2: Controlled Image Generation with Stable Diffusion

Overview

In this case study, you will explore the full Stable Diffusion pipeline for controlled image generation. Using pre-trained models from Hugging Face's diffusers library, you will implement text-to-image generation with classifier-free guidance, image-to-image translation, and ControlNet-guided generation. You will also investigate the effect of different parameters on output quality and learn practical techniques for achieving specific visual results.

Problem Statement

Build a comprehensive image generation toolkit that demonstrates the controllability of Stable Diffusion across three tasks:

  1. Text-to-image: Generate high-quality images from text prompts with various guidance scales
  2. Image-to-image: Transform existing images based on text prompts while preserving structure
  3. ControlNet: Generate images guided by spatial control signals (edge maps)

Approach

Step 1: Text-to-Image with Guidance Scale Analysis

We use Stable Diffusion v1.5 with the PNDM scheduler (a fast multi-step scheduler) and systematically analyze the effect of classifier-free guidance:

  • Fix a prompt ("a photograph of a castle on a cliff overlooking the ocean at sunset") and a random seed
  • Generate images at guidance scales: 1.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0
  • Measure CLIP score (text-image alignment) and visual quality for each

Expected findings: Quality and text alignment improve from $w=1$ to $w=7.5$, then artifacts appear at $w=15+$ (oversaturation, distortion).

Step 2: Prompt Engineering Techniques

We explore how prompt structure affects generation:

  • Positive prompts: "masterpiece, best quality, highly detailed, 8k" improves quality
  • Negative prompts: "blurry, low quality, distorted, watermark" avoids common artifacts
  • Prompt weighting: Using parentheses to emphasize specific elements: "(red:1.5) car"
  • Token position: Important concepts placed earlier in the prompt receive more attention

We generate a grid of images comparing different prompt strategies on the same seed and base concept.

Step 3: Image-to-Image Translation

The img2img pipeline adds noise to a source image and denoises with a new prompt:

  • Source: A photograph of a building
  • Target prompts: "a watercolor painting of a building," "a building in winter with snow"
  • Strength parameter sweep: 0.2, 0.4, 0.6, 0.8, 1.0
  • Low strength preserves the source structure; high strength allows more creative freedom

Step 4: ControlNet with Canny Edges

ControlNet enables precise spatial control:

  1. Load a reference image and extract Canny edges using OpenCV
  2. Use ControlNet with Canny edge conditioning to generate new images that follow the edge structure
  3. Vary the ControlNet conditioning scale (0.5, 0.75, 1.0, 1.5) to control how strictly edges are followed
  4. Combine ControlNet with different text prompts to change the style while preserving structure

Step 5: Combining Multiple Control Techniques

We demonstrate combining ControlNet with image-to-image:

  1. Start with a photograph
  2. Extract Canny edges
  3. Use ControlNet + img2img to transform the image with both structural and textual guidance
  4. Compare with ControlNet alone and img2img alone

Results

Guidance Scale Analysis

Guidance Scale CLIP Score Visual Quality Notes
1.0 0.21 Low Blurry, unfocused
3.0 0.26 Medium Recognizable but generic
5.0 0.29 Good Well-formed, moderate detail
7.5 0.31 Best Sharp, detailed, coherent
10.0 0.32 Good Slightly oversaturated
15.0 0.30 Degraded Oversaturated colors
20.0 0.27 Poor Severe artifacts, distortion

Image-to-Image Strength Analysis

  • Strength 0.2: Almost identical to the source image, minimal change
  • Strength 0.4: Subtle style transfer while preserving most structure
  • Strength 0.6: Clear transformation with recognizable source structure
  • Strength 0.8: Significant creative reinterpretation, loose structural similarity
  • Strength 1.0: Equivalent to text-to-image (source image fully overwritten)

ControlNet Results

ControlNet successfully constrains generation to follow edge maps while allowing full stylistic freedom. Conditioning scale of 1.0 provides the best balance between edge adherence and natural appearance. Scale 1.5 produces exact edge following but can appear rigid.

Key Lessons

  1. Guidance scale 7.5 is a good default for most prompts, but artistic subjects may benefit from lower values (5-6) for more creative variation.

  2. Negative prompts are essential for production-quality results. They steer the model away from common failure modes more effectively than positive prompts alone.

  3. Image-to-image strength is non-linear: The visual change between 0.4 and 0.6 is much larger than between 0.2 and 0.4, because the model needs a critical amount of noise to deviate from the source.

  4. ControlNet excels at structure preservation: For tasks like architectural rendering or pose-guided generation, ControlNet provides a level of spatial control that text prompts alone cannot achieve.

  5. Scheduler choice matters for speed: PNDM and DPM-Solver++ produce comparable quality at 20-30 steps, while the Euler ancestral scheduler may need 30-50 steps for equivalent quality.

  6. Seed reproducibility: Fixed seeds enable systematic comparison of parameters but different schedulers produce different images even with the same seed.

Ethical Considerations

  • Always respect copyright when using reference images for img2img or ControlNet
  • Disclose AI-generated content when sharing or publishing
  • Be mindful of generating images of real people without consent
  • Commercial use of Stable Diffusion outputs should comply with the model's license terms

Code Reference

The complete implementation is available in code/case-study-code.py.