Chapter 18: Image Generation — Midjourney, DALL·E, and Stable Diffusion

DataField.Dev

40 min read

Alex runs marketing campaigns for a mid-size consumer brand. When Midjourney became available, she spent an evening with it and then started arriving at client pitch meetings with mood boards she had built in ninety minutes rather than three days...

In This Chapter

How Image Generation Works: The Conceptual Model
The Three Major Platforms
Image Prompting Fundamentals
Midjourney Mastery
DALL·E 3 in ChatGPT: The Conversational Approach
Stable Diffusion for Non-Technical Users
Use Cases by Professional Role
Advanced Midjourney Techniques
Post-Processing AI-Generated Images
AI Image Generation for Specific Business Contexts
Copyright and Attribution Considerations
Midjourney Discord Workflow in Depth
DALL·E 3 Advanced Techniques
When NOT to Use AI Image Generation
Developing Your Prompting Eye: Learning to See What AI Produces
Common Failures and How to Handle Them
Research: Creative Professional Adoption
Prompt Templates for Six Common Image Types
The Evolving Landscape: What Has Changed and What Is Changing
Evaluating New Image Generation Tools
Summary: A Framework for Choosing Your Platform

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 18: Image Generation — Midjourney, DALL·E, and Stable Diffusion

Alex runs marketing campaigns for a mid-size consumer brand. When Midjourney became available, she spent an evening with it and then started arriving at client pitch meetings with mood boards she had built in ninety minutes rather than three days. The quality of those boards — the speed with which she could show "what if it looked like this?" — changed the dynamic of early-stage creative conversations.

Elena is a management consultant. She does not think of herself as a visual professional, but she spends significant time on presentation design because client-facing decks are a major deliverable. She started using DALL·E 3 through ChatGPT to generate custom graphics for presentations and stopped apologizing for slide design in client reviews.

Both of them developed very specific views about when AI image tools are useful and when they are not. Neither view was the obvious one you might expect. This chapter builds the understanding that lets you develop your own calibrated view.

How Image Generation Works: The Conceptual Model

You do not need to understand the mathematics of diffusion models to use them effectively. But a conceptual model helps you understand why prompting works the way it does and what the failure modes mean.

AI image generation uses a process called diffusion. During training, the model learns to recognize images by being shown millions of images with their text descriptions, and also by learning to "de-noise" images — to take a blurry, noisy version of an image and reconstruct a cleaner version. Over training on enormous datasets, the model builds an extremely rich understanding of what visual concepts look like and how text descriptions relate to visual content.

When you provide a prompt, the model starts with random visual noise and iteratively "de-noises" it, guided by your text, toward an image that matches the prompt. Each denoising step refines the image, guided by the model's learned understanding of what your description should look like.

This conceptual model explains several important things:

Why prompting works: The model learned from text-image pairs. Text that closely resembles how real images were described in training data produces more reliable results. This is why photography and art terminology — "f/1.8 bokeh," "chiaroscuro," "cinematic lighting" — often produces better results than plain language descriptions. The model has seen those terms paired with specific visual qualities countless times.

Why generation is stochastic: Each generation run starts from different random noise. The same prompt produces different images each time. Some runs will be better than others by luck of the starting noise. This is why you generate multiple versions and select.

Why some concepts are harder than others: The model learned from available training data. Concepts that appear frequently in diverse training images (common scenes, well-photographed subjects, established art styles) generate more reliably than concepts that are rare, culturally specific, or visually ambiguous.

Why hands are famously difficult: Hands appear in enormous variety in training data — at different angles, with different numbers of fingers visible, partially obscured, in motion. The model learned a complex, ambiguous distribution of "what hands look like." When generating, it produces something statistically consistent with training data, but that statistical average often produces hands with the wrong number of fingers or impossible anatomical configurations.

💡 Intuition: Prompting as Style Description The most useful mental model for image prompting is that you are describing an image that already exists, not instructing an artist. Describe the style, medium, lighting, mood, and composition as if you were explaining a photograph or painting you can see. "A photograph of a woman reading, shot with a 50mm lens, soft window light from the left, shallow depth of field, muted warm tones" describes visual qualities precisely. "A nice picture of someone reading" does not.

The Three Major Platforms

The AI image generation landscape has settled around three major platforms with distinctly different character, capabilities, and use cases.

Midjourney: Aesthetic Quality First

Midjourney is widely considered the current leader in aesthetic output quality. Images generated with Midjourney have a distinctive visual quality — high production value, sophisticated compositional sense, beautiful rendering of materials and light — that is immediately recognizable and consistently impressive.

Interface: Midjourney runs primarily through Discord. You use the /imagine command in a designated channel or bot DM, the bot generates four variations, and you select and refine from there. This workflow is unusual for software in 2026 but users report that it becomes natural quickly. A web interface at midjourney.com offers additional features for browsing and editing.

Strengths: Aesthetic output quality, handling of complex artistic styles, rendering of materials and textures, photorealistic imagery, cinematic composition. The v6 model (current as of writing) produces stunning outputs for many prompt types with relatively simple prompting.

Weaknesses: Limited text rendering in images, limited ability to specify precise spatial arrangements, inconsistency across a series of images (getting the same character or brand element to look the same across multiple generations is difficult without advanced techniques), no free tier.

Who it is for: Creative professionals, designers, marketers, anyone who prioritizes output quality and is willing to learn the platform's prompting conventions.

DALL·E 3: Convenience and Prompt Adherence

DALL·E 3 (OpenAI's third-generation image model) is integrated directly into ChatGPT, making it the most accessible of the three major platforms. You describe what you want in natural language, and ChatGPT handles the technical prompt construction — you do not need to learn special syntax.

Interface: Available through ChatGPT (with a Plus or higher subscription). You describe what you want in conversation, and ChatGPT generates images. You can also refine through conversation: "make it warmer," "add more depth to the background," "can you show this from a different angle?"

Strengths: Ease of use, natural language prompting, tight integration with ChatGPT for iterative refinement, better text rendering in images than competitors, strong adherence to complex text descriptions, ability to request variations on the same concept naturally in conversation.

Weaknesses: Output quality, while good, is generally considered below Midjourney at the top end. Less control over fine-grained aesthetic choices. Usage limits within ChatGPT.

Who it is for: General business users, people who need AI images occasionally rather than as a core workflow, anyone who wants images without learning a new system, users who need text within images.

Stable Diffusion: Open Source and Maximum Control

Stable Diffusion is an open-source image generation model that can run locally on a sufficiently powerful GPU, with no per-image cost. It is the foundation for an enormous ecosystem of fine-tuned models, plugins, and user interfaces.

Interface: Stable Diffusion is not a product — it is a model. User interfaces include Automatic1111 (AUTOMATIC1111/stable-diffusion-webui, the most feature-rich), ComfyUI (node-based workflow interface favored by power users), and Forge (a faster fork of Automatic1111). Cloud-hosted versions are available through various services for those without a suitable local GPU.

Strengths: Free to run locally, maximum control over generation parameters, enormous ecosystem of fine-tuned models for specific styles and domains, ControlNet for precise composition control, inpainting and outpainting capabilities, no content policies beyond your own choices, community-driven innovation moves faster than commercial alternatives.

Weaknesses: Requires technical setup, significant learning curve, requires a capable GPU for good performance (8GB VRAM minimum for current models, 12GB+ recommended), output quality varies significantly depending on which model you use and how you configure the generation.

Who it is for: Technical users, developers, power users who want maximum control, anyone with privacy requirements that preclude sending images to cloud services, users who generate high volumes and cannot justify per-image pricing.

⚠️ Common Pitfall: Platform Envy It is tempting to want the "best" platform and to feel that choosing a simpler option means settling. Resist this. DALL·E 3 through ChatGPT is the right tool if you need images occasionally and do not want to invest in learning Midjourney's workflow. Midjourney is the right tool if you generate images regularly and want consistently high aesthetic quality. Stable Diffusion is the right tool if you have technical aptitude and want maximum control or volume. Match the tool to your actual use case.

Image Prompting Fundamentals

Before going platform-specific, there are prompt elements that apply across all three platforms.

The Core Prompt Elements

Effective image prompts typically describe some combination of these elements:

Subject: What is the main subject of the image? Be specific. "A woman in a business meeting" is less specific than "A woman in her 40s presenting at a whiteboard to three colleagues in a modern conference room."

Style and Medium: What visual style should the image have? Photography? Oil painting? Watercolor? Digital illustration? Isometric 3D? Comic book? Specify the medium and any relevant style references.

Lighting: Lighting is one of the most powerful composition descriptors. "Soft natural window light," "dramatic side lighting," "golden hour sunlight," "overcast diffused light," "neon-lit at night" — each evokes a specific visual quality.

Composition: Camera angle, framing, depth of field. "Close-up portrait," "wide establishing shot," "bird's eye view," "rule of thirds composition," "shallow depth of field with blurred background."

Mood and Atmosphere: "Warm and inviting," "cold and clinical," "dreamy and ethereal," "tense and dramatic." These create the emotional register of the image.

Color Palette: Specific color direction. "Muted earth tones," "vibrant primary colors," "monochromatic blue palette," "warm golden tones."

Technical Photography Terms (for photorealistic images): Lens type (35mm, 50mm, 85mm portrait lens), aperture (f/1.8 for shallow depth of field, f/8 for sharp overall), camera model references (Hasselblad for medium-format look), film stock references (Kodak Portra 400 for warm film aesthetic).

Negative Prompts

Most platforms support negative prompts — descriptions of what you do not want in the image. These are powerful for removing common failure modes:

--no hands (Midjourney) or adding to a negative prompt field (Stable Diffusion) can reduce the frequency of deformed hands in images featuring people.

Common negative prompt content: blurry, out of focus, low quality, watermark, text, signature, extra limbs, deformed hands, ugly, distorted

Reference Images

All three platforms support using existing images as reference: - Style reference: "I want the style of this image applied to my prompt" - Subject reference: "I want this person/object to appear in a new image" - Composition reference: "I want a composition similar to this"

Using reference images dramatically improves consistency and helps the model understand what you want better than text alone can convey.

Aspect Ratios

Images generated in a 1:1 square format (the default) often look different from the same prompt in a landscape (16:9) or portrait (2:3) format. The composition adapts to the canvas. For specific use cases — social media, presentation slides, phone wallpapers, print products — specify the aspect ratio from the start.

✅ Best Practice: Prompt Iteration, Not Prompt Perfection Do not try to write the perfect prompt on the first attempt. Generate four images with a decent prompt, identify what is wrong with the best result, and refine from there. The iteration cost is low (a few seconds), and seeing actual outputs tells you far more than trying to predict what a prompt will generate. Professional users iterate 5-20 times for important images.

Midjourney Mastery

Midjourney's power users develop fluency with its parameter system. Understanding these parameters is the difference between generic results and precisely controlled output.

Prompt Structure

Midjourney prompts work best in this structure:

/imagine [subject description], [style/medium], [lighting], [composition], [mood], [technical parameters]

The order matters somewhat — elements at the beginning of the prompt receive slightly more weight. Important concepts belong earlier.

Example progression:

Basic: /imagine marketing team meeting in modern office

Better: /imagine diverse marketing team collaboration meeting, modern minimalist office, large windows, natural light, candid documentary photography style, warm professional atmosphere

Strong: /imagine diverse marketing team of five people collaborating around a glass conference table, modern minimalist office with large floor-to-ceiling windows, late afternoon golden light, candid documentary photography, shot on Sony A7IV with 35mm lens, warm professional atmosphere, editorial style --ar 16:9 --v 6.1

Key Midjourney Parameters

--ar [width:height] — Aspect ratio. --ar 16:9 for landscape video/presentation format, --ar 2:3 for portrait/phone, --ar 1:1 for square social media.

--v [version] — Model version. --v 6.1 (current) for highest quality photorealism. --v 5 for some users' preference for stylized artistic output.

--style [value] — Applies aesthetic variation. --style raw turns off Midjourney's automatic aesthetic enhancement, giving you more direct control but requiring more complete prompting.

--chaos [0-100] — Controls variation between generated images. Low chaos (0-10) produces four similar variations. High chaos (50-100) produces four very different interpretations. Use high chaos early in exploration, low chaos when you have found a direction you want to refine.

--weird [0-3000] — Introduces unusual, experimental aesthetic qualities. Useful for finding unexpected creative directions. Most useful at moderate values (250-500).

--quality [.25, .5, 1] or --q — Controls generation quality and time. --q .5 is faster and cheaper; --q 1 uses full quality. For exploration, .5 is fine. For final outputs, use 1.

--no [concept] — Negative prompt. --no hands significantly reduces deformed hand frequency in images with people.

--seed [number] — Reproducibility. Using the same seed with the same prompt produces the same image. Useful for consistency across variations.

The /describe Command

The /describe command takes an image you upload and produces four prompt suggestions that might have generated it. This is invaluable for:

Reverse-engineering the style of reference images you like
Learning prompt vocabulary for visual effects you want to achieve
Understanding how Midjourney interprets visual concepts

Upscaling and Variations Workflow

After generating four initial images, the workflow options are:

U1-U4: Upscale one of the four images to higher resolution. The current upscaler also adds subtle detail.

V1-V4: Generate four new variations based on one of the four images, keeping the general composition but varying details.

Re-roll (🔄): Generate four entirely new images with the same prompt.

Vary (Region): Select a region of the image and regenerate only that area — Midjourney's inpainting feature. Useful for fixing specific problems (a strange hand, an odd face) without regenerating the full image.

The professional workflow: generate → identify the strongest of four → upscale or vary → refine further → upscale final.

Reference Images in Midjourney

--sref [URL] — Style reference. Applies the style of the referenced image to your prompt.

--cref [URL] — Character reference. Attempts to maintain the appearance of a character from the reference image in the new generation. Useful for creating consistent characters across multiple images.

You can use multiple reference images simultaneously: --sref [URL1] [URL2] mixes the styles. The --sw [0-1000] and --cw [0-100] parameters control how strongly the reference influences the output.

🎭 Scenario Walkthrough: Alex's Mood Board Session

Alex has a brief for a health food brand campaign: natural, vibrant, real people (not stock photo perfect), outdoors when possible, authentic rather than aspirational.

She starts with high chaos to explore directions: /imagine healthy lifestyle food photography, fresh vegetables, natural outdoor light, real diverse people, authentic documentary style --ar 16:9 --chaos 70

From four outputs, two feel too polished (too much like stock photos), one is interesting but compositionally off, one has the right energy — outdoor, candid, real-feeling. She notes the visual qualities of the good one and uses /describe on it to understand how Midjourney is reading its qualities.

She refines: /imagine sun-weathered hands holding fresh vegetables at outdoor farmers market, golden morning light, shot on Fujifilm X-T5 with 23mm lens, candid documentary style, warm earth tones, authentic moment not posed --ar 3:2 --chaos 20

Three iterations later, she has twelve images she is genuinely happy with. Total time: eighty minutes. She could have spent three days on this with a traditional stock photo workflow.

DALL·E 3 in ChatGPT: The Conversational Approach

DALL·E 3's integration with ChatGPT changes the prompting model fundamentally. You are not writing a technical prompt — you are having a conversation with an assistant that translates your intent into an effective prompt behind the scenes.

Natural Language Prompting

The primary advantage of DALL·E 3 through ChatGPT is that you can describe what you want the way you would explain it to a designer:

"I need an image for a consulting firm presentation about digital transformation. Something that shows the bridge between traditional business and digital technology — not clichéd with gears and circuit boards, something more thoughtful and human."

ChatGPT processes this, constructs a more technical prompt, and generates the image. If the first result does not capture the spirit, you say so conversationally:

"The first one is too abstract. I want to see actual people — maybe showing a meeting where someone is presenting data on a screen to business colleagues."

Using ChatGPT to Refine Image Prompts

A powerful DALL·E workflow: before generating, ask ChatGPT to develop your image concept into a rich prompt, then review and adjust before generating.

"Help me develop an image prompt for: a scene showing AI tools helping a small business owner, for use in a business magazine article. I want it to feel positive and empowering, not dystopian or robotic. Let's talk through what the image should contain before I generate it."

This upfront conversation often produces better first generations than direct prompting.

DALL·E 3 in ChatGPT supports iterative refinement through conversation. You can say:

"Can you make this warmer?"
"Show the same scene from a wider angle"
"Remove the person on the left"
"Make it feel more evening light, less midday"

Each request generates a new image with your modification applied. The conversational history provides context for each refinement.

Text in Images

DALL·E 3 is significantly better at rendering readable text within images than other models. For images that need to include words — signage, labels, book titles, presentation slide mockups with placeholder text — DALL·E 3 is the better choice.

Inpainting

DALL·E 3 supports inpainting through the canvas editor: upload an image, paint over the region you want to change, describe what should replace it. This is useful for: - Fixing specific elements in otherwise good images - Adding or removing objects from scenes - Adapting existing images for new purposes

⚠️ Common Pitfall: Expecting Precise Spatial Control DALL·E 3 and all diffusion models interpret composition loosely. "Put the logo in the top right corner" will not reliably place the logo in the top right corner. For precise spatial arrangements, you need Stable Diffusion with ControlNet, or post-generation editing in traditional design tools. Use AI generation for creative concept and aesthetic — use design tools for precise layout.

Stable Diffusion for Non-Technical Users

Stable Diffusion has a reputation for complexity that discourages non-technical users. With modern interfaces, the learning curve is manageable and the control it offers is unmatched.

Getting Started

The simplest path to Stable Diffusion without technical setup: cloud-hosted services like Replicate, Civitai's online generator, or Stability AI's DreamStudio provide web interfaces to Stable Diffusion models without local installation. These have per-generation costs but no setup requirements.

For users who want local installation and have a reasonably powerful NVIDIA GPU (8GB VRAM minimum), the Automatic1111 web interface is the most commonly used option. Setup involves installing Python and Git, cloning the repository, and running an installation script. Copious tutorials exist. The initial setup takes 30-60 minutes; subsequent use requires only launching a script.

Choosing Models

Stable Diffusion's ecosystem includes thousands of fine-tuned models optimized for specific aesthetics: - Realistic Vision and DreamShaper: General-purpose photorealism - SDXL (Stable Diffusion XL): Higher resolution, better prompt following - Flux (the most recent generation, from Black Forest Labs): Dramatically improved quality, better text rendering, increasingly adopted as the new standard - Anime, illustration, architectural visualization, fashion, and countless other specialized models

Selecting the right base model is the most impactful choice in the Stable Diffusion workflow.

ControlNet: The Distinguishing Feature

ControlNet is what separates Stable Diffusion from the competition for users who need precise compositional control. ControlNet lets you provide a control image that constrains the composition:

Pose control: Extract a human skeleton from a reference image and generate a new image with the same body position
Edge/line control: Provide a line drawing or edge-detected image as a compositional template
Depth control: Use a depth map to control spatial arrangement
Canny/Scribble: Use rough sketches to guide composition

This is the capability that makes Stable Diffusion the choice for users who need to show a specific composition, pose, or layout that they cannot achieve through text prompting alone.

Use Cases by Professional Role

Alex: Marketing and Campaign Ideation

High-value uses: - Mood boards for client presentations: Midjourney for aesthetic quality, 60-90 minute workflow from brief to board - Concept exploration: generating five different visual directions for a campaign in an afternoon - Social media content: product lifestyle imagery, abstract visual concepts for brand channels - Ad concept visualization: rough visual approximations of campaign concepts before briefing a photographer or creative director

Lower-value or inappropriate uses: - Final ad creative that will run at scale (requires rights clarity, brand consistency, quality control) - Replacing photographer shoots for hero campaign images (consistency, licensing, model releases) - Images of specific real people or celebrities (significant legal and ethical concerns)

Alex's key workflow: Midjourney for exploration and mood boards, DALL·E 3 for quick one-off visuals that need to match a specific brief description, Photoshop or Canva for combining and finalizing AI-generated elements with designed layouts.

Elena: Consulting and Business Communication

High-value uses: - Custom presentation graphics: conceptual illustrations for complex ideas (digital transformation, organizational change, strategic frameworks) - Deck cover images: professional, relevant, not generic stock photo - Report section headers: consistent visual language across a long document - Workshop materials: visual aids that look custom rather than Shutterstock

Lower-value or inappropriate uses: - Data visualizations (use proper charting tools) - Images that will be passed off as photographs of actual events - Any image depicting real, identifiable clients or their workplaces

Elena's key workflow: DALL·E 3 through ChatGPT for its ease of use and natural language iteration, focused on conceptual illustration rather than photorealism. She keeps prompts saved for image types she returns to frequently.

Advanced Midjourney Techniques

Once you have the fundamentals working in Midjourney, several advanced techniques open up significantly more control over output.

Blending Multiple Images

The /blend command combines 2-5 images into a new composite, inheriting visual elements from each source. This is different from using reference images in a text prompt — blend combines the images themselves, not just their described qualities. Useful for:

Creating color palette mashups (blend a brand color reference with a subject image to apply the palette)
Combining stylistic elements from multiple references
Creating hybrid concepts that are hard to describe in text

Pan and Zoom

Midjourney's pan and zoom features extend generated images beyond their initial frames. An upscaled image can be panned left, right, up, or down, generating new content that continues the scene. Zoom Out generates a new image that shows your original as a smaller element within a larger scene.

These features enable creating wider scenes than the initial prompt intended, showing context around a subject, and recovering from images where the subject is slightly too close.

Inpainting for Targeted Fixes

Midjourney's "Vary (Region)" inpainting lets you repaint specific regions of an upscaled image while keeping the rest intact. Workflow:

Upscale the best of your four generations
Click "Vary (Region)"
Paint over the area you want to change
Describe what should appear instead (or leave the original prompt)
Generate three new variations of just that region

This is particularly useful for fixing specific problems — a strange hand, an inconsistent background element, a face that did not generate well — without regenerating the entire image from scratch.

The Midjourney Prompting Style Database

Regular Midjourney users develop a personal reference library of prompt elements that reliably produce specific results. Building this library deliberately accelerates skill development:

Lighting vocabulary: "Rembrandt lighting," "butterfly lighting," "split lighting," "hour golden sunlight," "overhead fluorescent," "neon ambient light" — each produces distinctive, recognizable lighting results.

Photographic vocabulary: "DLSR photo," "medium format," "shot on iPhone," "daguerreotype," "wet plate collodion," "photographic film grain" — each shifts the image toward different photographic aesthetics.

Artistic medium vocabulary: "gouache painting," "watercolor sketch," "vector illustration," "linocut print," "etching," "oil on canvas" — each produces different textural and rendering qualities.

Composition vocabulary: "close-up portrait," "environmental portrait," "detail shot," "aerial view," "worm's eye view," "symmetrical composition," "negative space composition" — each shifts spatial relationships.

Investing time in building this vocabulary through experimentation produces lasting returns. An hour of systematic exploration — running the same subject with ten different lighting descriptors — teaches you what those terms actually produce in ways that reading descriptions cannot.

Post-Processing AI-Generated Images

Raw AI-generated images often benefit from post-processing before professional use. Understanding what post-processing typically adds helps you decide when AI generation is "done" and when it needs additional work.

Color Grading

AI-generated images sometimes have color characteristics that do not match your brand palette or the tone you want. Basic color grading in Photoshop, Lightroom, or even Canva can:

Pull the palette toward brand colors
Increase or decrease warmth
Adjust contrast for the intended display environment
Unify a set of images that were generated separately and have slightly different tones

For regular users generating images for consistent brand contexts, developing a color grade preset that you apply to all AI-generated images creates visual consistency that individual generation cannot reliably maintain.

Compositing AI Elements

For presentations and marketing materials, AI-generated images are often more useful as elements within a designed layout than as standalone images. Common workflows:

AI-generated concept image as a section background, with text overlaid
AI-generated illustration combined with charts or data visualization
AI-generated product lifestyle image with the actual product composited in afterward (particularly for new products that cannot be photographed yet)

This compositing approach also sidesteps some of the brand consistency challenges — the AI generates the visual context, traditional design tools handle the brand-specific elements.

Sharpening and Resolution Enhancement

Midjourney v6 and DALL·E 3 generate at relatively high resolution (typically 1024x1024 or higher), but for large-format print use, additional upscaling may be needed. AI-powered upscalers like Topaz Gigapixel AI or the generative upscaling in Adobe Photoshop can add resolution while maintaining or improving image quality. For most digital use cases (presentations, web, social media), native generation resolution is sufficient.

Handling the Hands Problem in Post-Processing

When AI generates an image with prominently visible hands that have the classic wrong-finger-count or impossible-anatomy problem, and you have an otherwise excellent image, inpainting (either Midjourney's Vary Region or Photoshop's generative fill) can fix the hands without regenerating from scratch.

This workflow: upscale the image in Midjourney → use Vary (Region) with a painted mask over the hands → prompt for "realistic hands, correct anatomy, five fingers" → generate three attempts → select the best result and merge.

AI Image Generation for Specific Business Contexts

Beyond the general use cases covered earlier, several specific business contexts benefit from more detailed guidance.

E-Commerce Product Imagery

Stable Diffusion with appropriate models can generate product imagery in a studio context — product placed on a clean background with professional lighting — at a fraction of the cost of a product photography shoot. For high-volume e-commerce with many SKUs, this can be economically significant.

The workflow: photograph the actual product on a simple background, use Stable Diffusion's inpainting or SDXL's background removal + generation to place it in various studio or lifestyle contexts. The product itself is real; the context is AI-generated.

Trust calibration note: for high-value products or brand-sensitive categories, AI-generated product imagery may not yet meet the quality bar for main hero images. For secondary lifestyle images, additional variants, or lower-budget product lines, the quality bar is often met.

For organizations that maintain high-volume social media presences, AI image generation changes the economics of visual content creation. At 20-30 posts per week, each needing a custom image, even DALL·E 3 (at roughly 2-3 minutes per image including iteration) represents significant efficiency versus stock photo search or design commission.

Practical workflow for social media teams: 1. Develop 5-10 prompt templates for recurring content categories (product features, brand moments, campaign posts, educational content) 2. Adapt each template for specific posts with 1-2 sentences of content-specific direction 3. Generate, iterate once or twice if needed, apply color grade preset 4. Review and select

The templates plus color grade preset create brand visual consistency without requiring the same level of craft for each individual image.

Internal Communication and Training Materials

The most underexplored high-value use case for many organizations: custom imagery for internal documents, training materials, process documentation, and internal presentations. The quality bar is lower than external-facing materials, the rights considerations are simpler (internal use), and the volume can be high.

Most employees who create internal materials default to clip art, stock photos, or no imagery because the friction of custom image creation is too high. AI generation dramatically lowers that friction. Training decks, internal reports, policy documentation, and onboarding materials all become visually richer with minimal additional effort.

Rapid Concept Visualization for Product Teams

Product and UX teams routinely need to visualize concepts before they exist — to communicate to stakeholders what a future product or feature might look like, or to explore design directions before investing in high-fidelity prototyping. AI image generation is not a substitute for product design tools (Figma, Sketch) but is a useful step earlier in the process.

A product manager describing a new feature to an engineering team can generate a rough visual representation of the concept in ten minutes and communicate much more precisely than a verbal description alone. This is not UI design — it is visual concept communication.

Copyright and Attribution Considerations

This topic is covered in depth in Chapter 34 (AI, Copyright, and Intellectual Property). The essential practical points for now:

AI-generated images and copyright: The legal landscape regarding AI-generated images and copyright is unsettled as of early 2026. In most major jurisdictions, pure AI-generated images without significant human creative contribution are not clearly copyrightable by the person who prompted them. This matters for commercial use.

Platform terms and commercial rights: Each platform's terms of service govern commercial use of generated images. Midjourney grants paid subscribers rights to use generated images commercially. DALL·E 3 through ChatGPT grants usage rights per OpenAI's terms. Read the current terms for any platform you use commercially — they change.

Training data and style claims: Courts in multiple jurisdictions are considering whether models trained on copyrighted images infringe those copyrights. No definitive rulings exist as of writing. Consult legal counsel for high-stakes commercial applications.

Practical guidance: For internal business use (presentations, internal communications, mood boards), the risk profile is low. For published commercial use (advertising, products, books), consult current legal guidance and your organization's policies. For high-budget campaigns, use AI for concepting and traditionally produced images for final deliverables.

⚠️ Common Pitfall: Assuming Commercial Freedom "I made it" does not equal "I own it" for AI-generated images, and "I own it" does not equal "I have unlimited commercial rights." Understand the specific rights granted by each platform you use, and apply additional scrutiny when using AI images in commercial contexts that could expose your organization to legal risk.

Midjourney Discord Workflow in Depth

New users often find Midjourney's Discord-based workflow the most disorienting aspect of the platform. This section walks through it in enough detail to eliminate the friction.

After creating a Midjourney account and subscribing, you join the Midjourney Discord server. You can generate images in the public channels (where your images are visible to other users) or in a direct message with the Midjourney bot (private, but with the same generation capabilities). Most professional users quickly move to DM with the bot to keep their work private.

In DM with the bot, type /imagine and the command interface appears. Your prompt goes in the prompt field after the slash command.

Reading the Generation Interface

When you submit a prompt, Midjourney shows a progress bar as the image generates (typically 20-60 seconds depending on parameters). When complete, you see four images in a 2x2 grid, numbered left-to-right, top-to-bottom: 1, 2, 3, 4.

Beneath the image grid: buttons labeled U1-U4 (Upscale) and V1-V4 (Variations), plus a refresh button. Additional options appear after upscaling: Vary (Subtle), Vary (Strong), Vary (Region), and the zoom/pan options.

Managing Your Generations

High-volume Midjourney users quickly discover that Discord is not a great place to manage and find generated images. The midjourney.com web interface provides a gallery view of all your generations, searchable and filterable. This is where most professionals go to find specific past generations.

Downloading images: right-click on an upscaled image in Discord and save, or use the web gallery's download feature. Upscaled images are typically 1024x1024 or larger depending on the version and upscaling option used.

Fast Mode vs. Relax Mode

Midjourney subscriptions include a certain number of "fast mode" GPU hours per month. When you exhaust those, you are in "relax mode," where images take longer to generate (often 5-10 minutes versus 20-60 seconds). Heavy users on lower-tier plans encounter relax mode frequently during periods of intensive use. To check your remaining fast hours, use /info.

For time-sensitive work (client presentations, meeting prep), keep an eye on your remaining fast hours and consider whether to save them or pay for additional GPU time.

DALL·E 3 Advanced Techniques

Beyond the basics, several DALL·E 3 techniques improve results for professional use.

The Prompt Expansion Request

Before generating, ask ChatGPT to develop your brief description into a more complete image prompt. This leverages ChatGPT's understanding of what makes effective image prompts:

"Before generating, write out a detailed image generation prompt for: [your brief description]. Show me the prompt text, then use it to generate."

Reading ChatGPT's expanded prompt teaches you what additional elements it considers important and often reveals quality improvements you can adopt in your own direct prompting.

Consistent Visual Threads Across a Conversation

Within a single ChatGPT conversation, DALL·E 3 has some ability to maintain visual consistency with previous images in the conversation. Rather than starting fresh each time, continuing in the same conversation and referring to "the same style as the image above" or "make it consistent with the previous image" produces more visually coherent results than starting a new conversation.

This is not perfect — the consistency is partial, not guaranteed — but for creating 3-4 images for the same presentation section, it provides more cohesion than generating each independently.

Combining Image Analysis and Generation

ChatGPT's ability to analyze uploaded images and then generate new images creates a useful workflow for style matching. Upload a reference image (a photograph whose lighting you want to match, a published illustration whose style you admire), ask ChatGPT to analyze what makes the visual work, then ask it to generate a new image in that style.

"Analyze this image: what creates its visual mood? What lighting, color palette, and compositional choices are most distinctive? Now generate an image of [my subject] using those same qualities."

For Presentation-Ready Output

When generating images specifically for use in presentations: - Specify --ar 16:9 or describe the "wide landscape format" for full-bleed slide backgrounds - Request "high key lighting" or "clean background" for images that need to work with overlaid text - Ask for images without strong focal points in the center-left of the frame if you plan to overlay text on the right side - Specify "no text" explicitly — DALL·E 3 sometimes adds watermark-style text if not instructed otherwise

When NOT to Use AI Image Generation

AI image generation is powerful enough that it is worth explicitly naming when not to use it.

When you need precise visual accuracy: Documenting an actual space, showing real products with accurate specifications, technical diagrams with exact measurements.

When authenticity of source matters: Journalism, documentary content, social media posts where audiences expect real photographs.

When brand consistency across a series is critical: Without significant investment in advanced techniques (character references, fine-tuning), maintaining exact visual consistency across a campaign is difficult. A traditional photographer shoot is more consistent.

When depicting real people: Generating photorealistic images of real, identifiable people raises significant legal (right of publicity, defamation) and ethical concerns. This is a clear "when not to."

When the time investment is not worth it: For a quick graphic you need in five minutes and have stock photo access, use the stock photo. AI image generation has a workflow overhead that makes it inefficient for very quick, low-stakes needs.

Developing Your Prompting Eye: Learning to See What AI Produces

The gap between new and experienced AI image generation users is primarily a perceptual gap — experienced users have learned to see the qualities in AI outputs that indicate what went wrong and what to change in the prompt. Building this perceptual skill deliberately accelerates the learning curve.

Learning from Failure

Save your failed generations, not just your successful ones. Review them periodically and ask: what specific visual quality is wrong? Is the lighting direction inconsistent? Is the composition too centered? Is the color palette not what you intended? Is the mood wrong?

For each failure, try to articulate a specific change to the prompt that would address the problem. This diagnostic practice — failure → specific diagnosis → prompt change → new generation — builds the feedback loop that produces rapid skill improvement.

The Reference Image Habit

Experienced Midjourney users routinely run /describe on images they admire — not just to generate similar images, but to build vocabulary. Every /describe session expands the library of terms you understand in context. Over time, you develop a rich vocabulary of visual quality descriptors that makes prompting more precise.

Build a personal reference folder: save images (from photography books, design publications, stock photo libraries, or any visual source) that represent the visual qualities you want to invoke in your work. When starting a new generation project, review these references and use /describe to generate vocabulary.

Comparative Generation

A powerful learning practice: generate the same subject with systematically varied prompts, changing one element at a time. Generate the same portrait with ten different lighting descriptors. Generate the same scene with five different photographic style references. Generate the same illustration with different medium descriptors.

This systematic approach produces a personal reference library — visual evidence of what specific terms produce — that is more useful than any written guide because it is built from your own observations on your specific subjects.

Common Failures and How to Handle Them

Hands and Fingers

The classic AI image tell: hands with the wrong number of fingers, anatomically impossible finger positions, melted or merged hands. All diffusion models struggle with hands because hands appear in enormously varied positions in training data.

Mitigation: Use --no hands in Midjourney, add deformed hands, extra fingers, malformed hands to the negative prompt in Stable Diffusion, or use inpainting to fix hand regions specifically. Alternatively, compose images where hands are not prominently visible.

Text in Images

Text generated within images is frequently misspelled, stylistically inconsistent, or illegible with non-DALL·E models. DALL·E 3 handles text significantly better than Midjourney.

Mitigation: For non-DALL·E platforms, use text overlays in post-processing (Canva, Photoshop) rather than relying on AI to generate readable text within the image.

Consistency Across Multiple Images

Generating the "same" character, setting, or brand element across multiple images is genuinely difficult without advanced techniques. Each generation is statistically independent.

Mitigation: Midjourney's --cref character reference provides some consistency. Stable Diffusion with fine-tuned models trained on specific subjects (using techniques like LoRA or Dreambooth) provides more reliable consistency. For campaigns requiring high consistency across many images, a traditional production approach is more reliable.

Photorealism of Real People

Generating convincingly photorealistic faces is something diffusion models can do well — which creates both capability and concern. Generated faces of nonexistent people are useful. Attempting to generate photorealistic images of real, specific people is a use case with significant legal and ethical problems.

The 15-Second Look

With experience, you can recognize AI-generated images within seconds: subtle skin texture issues, anatomical improbabilities, inconsistent lighting logic across the image, background elements that dissolve into incoherence under examination. As models improve, these tells become harder to spot. The ability to distinguish AI-generated images from real photographs is a diminishing skill. Plan your communication practices accordingly — if accurate representation of reality matters, use real photographs.

Research: Creative Professional Adoption

Research on AI image generation adoption in creative professions shows a more complex picture than either "AI replaces creatives" or "professionals reject AI tools" would suggest.

Surveys of creative professionals (designers, art directors, photographers) consistently show high awareness and experimentation rates — well over half of professional creatives in major markets have tried AI image generation. Adoption into regular professional workflow is lower but growing.

Professional adoption is concentrated in specific use cases: ideation and concepting (high adoption), client presentations and mood boards (high adoption), final deliverables for major campaigns (low adoption). The pattern suggests that professionals have found genuine value in AI as a concepting and communication tool while remaining skeptical of it as a replacement for final production quality.

The most significant pushback from creative professionals is not about output quality (which is acknowledged as impressive) but about attribution, training data ethics, and the impact on commercial photography markets. These concerns are legitimate and ongoing.

For non-creative professionals (marketers, consultants, researchers) who use images instrumentally — to communicate ideas, not as art — adoption is growing rapidly with fewer reservations.

Prompt Templates for Six Common Image Types

1. Business/Professional Photography Style

[Subject description], professional editorial photography, modern [office/setting type],
natural window light, authentic candid moment, shot on [camera reference],
warm professional atmosphere, diverse professionals, --ar 16:9 --v 6.1

2. Conceptual Illustration for Business

Conceptual illustration of [business concept], flat design style, clean minimal aesthetic,
[color palette], professional business communication, editorial illustration,
white background variant, suitable for presentation slide --ar 16:9

3. Product Lifestyle Photography

[Product type] lifestyle photography, [setting description], natural light,
aspirational [consumer demographic] lifestyle, editorial product photography,
shot on [camera], [color mood], commercial photography quality --ar 3:2 --v 6.1

4. Abstract Visual Concept

Abstract visualization of [concept], [art movement/style reference],
[color palette], [mood descriptors], [medium: digital art / oil painting /
watercolor / etc.], professional contemporary aesthetic --ar 16:9

5. Campaign Mood Board Image

[Campaign theme] aesthetic mood image, [target demographic] lifestyle,
[season/time of day], [key brand values: warm/cool/energetic/calm],
editorial photography style, [color direction], authentic moment not posed --ar 2:3

6. Presentation Cover Image

Professional presentation cover image for [topic/industry], sophisticated minimal aesthetic,
[color palette consistent with brand], [metaphorical visual: bridge/horizon/network/etc.],
corporate communication design, suitable for full-bleed slide background --ar 16:9

📋 Action Checklist: Before Using AI-Generated Images Commercially - [ ] Confirmed the platform's current terms of service permit commercial use - [ ] Reviewed your organization's AI use policy for images - [ ] Verified no real, identifiable people appear in the image - [ ] Considered whether the use case requires attribution or disclosure - [ ] Checked that image accuracy claims are appropriate (AI image ≠ real photograph) - [ ] Determined whether legal review is warranted for high-stakes commercial use

The Evolving Landscape: What Has Changed and What Is Changing

The AI image generation landscape in 2026 looks dramatically different from 2022, and the pace of change continues. Understanding the trajectory helps you calibrate expectations.

Model Generations and Quality

Each major model update from Midjourney (v3, v4, v5, v5.2, v6, v6.1) has represented meaningful quality improvements. Photorealism, handling of complex scenes, text rendering, and consistency have all improved substantially across model generations. The same is true for DALL·E and the Stable Diffusion model family (with the Flux models from Black Forest Labs representing a significant jump in the open-source ecosystem as of 2025).

This improvement trajectory means that limitations noted in any guide — including this one — may be partially addressed by the time you read it. Hands have improved. Text rendering has improved. Consistency across images has improved. The specific weaknesses shrink with each generation, though new challenges sometimes emerge.

Video Generation

The next frontier in generative media is video. Sora (OpenAI), Runway Gen-3 Alpha, Kling, and comparable tools have made significant progress on AI video generation. The same prompting principles apply — describe visual qualities, lighting, mood, motion — but the medium adds temporal consistency requirements (subjects should look the same across frames) that are even harder to maintain than spatial consistency in images.

As of early 2026, AI video generation is not yet at the quality level for professional broadcast or advertising production, but is useful for: - Rough concept visualization - Social media short-form content where high production value is less expected - Motion graphics and animated visual concepts

The video generation landscape is developing rapidly and will be meaningfully different by the time you are reading this.

Open Models and Local Capability

The open-source image generation ecosystem (primarily Stable Diffusion variants and the newer Flux family) continues to close the quality gap with proprietary models. For practitioners who prioritize privacy, cost, or customization, the local ecosystem is increasingly viable.

The most significant development is the availability of high-quality models that run on consumer hardware (12-16GB VRAM) at acceptable speeds. The cost of entry for high-quality local image generation has decreased significantly and continues to decrease.

Multimodal Integration

All major AI platforms are moving toward deeper multimodal integration. ChatGPT can discuss an image you upload, then modify it via DALL·E. Claude can analyze visual content. Gemini has extensive image capabilities. The distinction between "image generation tool" and "AI assistant with image capabilities" is blurring.

For practitioners, this means the choice of whether to use a dedicated image generation platform versus a multimodal assistant will increasingly depend on specific capability needs and workflow preference rather than fundamental availability.

Evaluating New Image Generation Tools

The principles for evaluating any new image generation tool that appears in the market:

Output quality: Generate the same 5-10 prompts you use as benchmarks across platforms. Quality comparisons are meaningful only when you run the same prompts, not when you compare cherry-picked examples from different vendors.

Prompt adherence: Does the tool generate what you actually described, or does it substitute its own aesthetic preferences for yours? Some tools (Midjourney historically) have strong aesthetic opinions that produce beautiful results but may not match your specific vision. Others (DALL·E 3) have higher prompt adherence.

Iteration speed and workflow: How long does generation take? How many refinement steps are typically needed? What does the iteration workflow look like? Slow, high-friction iteration workflows negate much of the productivity benefit.

Pricing model: Per-generation pricing creates friction around experimentation. Subscription pricing (unlimited or high-volume) enables the iterative approach that produces the best results. Evaluate total cost at your expected volume.

Rights and commercial use: The most important question for professional use. Verify current platform terms — they change.

Privacy: For commercial creative work, what happens to the images you generate? Are they stored, indexed, or potentially shown to other users? Professional use often requires clarity on this.

Summary: A Framework for Choosing Your Platform

The platform choice decision simplifies to three questions:

Do you need maximum aesthetic quality and will you generate images regularly? → Learn Midjourney. The workflow investment pays off quickly.

Do you need convenience and occasional images without a new tool? → DALL·E 3 in ChatGPT. Natural language prompting and ChatGPT's conversation interface make it the most accessible option.

Do you need maximum control, volume, or privacy, and have technical aptitude? → Stable Diffusion, either cloud-hosted for easy start or local for volume and control.

For most business professionals who are not creatives by training, DALL·E 3 is the right starting point. For marketing and creative professionals who will use image generation as a core workflow tool, Midjourney rewards the learning investment.

The images you generate are tools for thinking and communication. The best platform is the one that fits into your actual workflow without requiring more overhead than the value it provides.

The next chapter steps back from the general-purpose platforms to examine the landscape of specialized and domain-specific AI tools — legal, medical, financial, scientific, and beyond.

In This Chapter

Chapter 18: Image Generation — Midjourney, DALL·E, and Stable Diffusion

How Image Generation Works: The Conceptual Model

The Three Major Platforms

Midjourney: Aesthetic Quality First

DALL·E 3: Convenience and Prompt Adherence

Stable Diffusion: Open Source and Maximum Control

Image Prompting Fundamentals

The Core Prompt Elements

Negative Prompts

Reference Images

Aspect Ratios

Midjourney Mastery

Prompt Structure

Key Midjourney Parameters

The /describe Command

Upscaling and Variations Workflow

Reference Images in Midjourney

DALL·E 3 in ChatGPT: The Conversational Approach

Natural Language Prompting

Using ChatGPT to Refine Image Prompts

Iterative Refinement in ChatGPT

Text in Images

Inpainting

Stable Diffusion for Non-Technical Users

Getting Started

Choosing Models

ControlNet: The Distinguishing Feature

Use Cases by Professional Role

Alex: Marketing and Campaign Ideation

Elena: Consulting and Business Communication

Advanced Midjourney Techniques

Blending Multiple Images

Pan and Zoom

Inpainting for Targeted Fixes

The Midjourney Prompting Style Database

Post-Processing AI-Generated Images

Color Grading

Compositing AI Elements

Sharpening and Resolution Enhancement

Handling the Hands Problem in Post-Processing

AI Image Generation for Specific Business Contexts

E-Commerce Product Imagery

Social Media Content at Scale

Internal Communication and Training Materials

Rapid Concept Visualization for Product Teams

Copyright and Attribution Considerations

Midjourney Discord Workflow in Depth

Account Setup and Server Navigation

Reading the Generation Interface

Managing Your Generations

Fast Mode vs. Relax Mode

DALL·E 3 Advanced Techniques

The Prompt Expansion Request

Consistent Visual Threads Across a Conversation

Combining Image Analysis and Generation

For Presentation-Ready Output

When NOT to Use AI Image Generation

Developing Your Prompting Eye: Learning to See What AI Produces

Learning from Failure

The Reference Image Habit

Comparative Generation

Common Failures and How to Handle Them

Hands and Fingers

Text in Images

Consistency Across Multiple Images

Photorealism of Real People

The 15-Second Look

Research: Creative Professional Adoption

Prompt Templates for Six Common Image Types

1. Business/Professional Photography Style

2. Conceptual Illustration for Business

3. Product Lifestyle Photography

4. Abstract Visual Concept

5. Campaign Mood Board Image

6. Presentation Cover Image

The Evolving Landscape: What Has Changed and What Is Changing

Model Generations and Quality

Video Generation

Open Models and Local Capability