Chapter 18 Quiz: Image Generation — Midjourney, DALL·E, and Stable Diffusion

DataField.Dev

Chapter 18 Quiz: Image Generation — Midjourney, DALL·E, and Stable Diffusion

Test your understanding of AI image generation platforms, prompting strategies, and practical applications. Consider your answer before revealing it.

Question 1

Conceptually, how do diffusion models generate images?

A) They look up images in a large database and combine elements from matching results B) They draw images pixel by pixel from top to bottom based on the prompt C) They start from random visual noise and iteratively refine it toward an image that matches the prompt, guided by their learned understanding of text-image relationships D) They use a rule-based system to place pre-defined visual elements according to prompt keywords

Answer

**C) They start from random visual noise and iteratively refine it toward an image that matches the prompt, guided by their learned understanding of text-image relationships** Diffusion models learn from millions of text-image pairs and from the process of "de-noising" images. At generation time, they start from random noise and apply learned de-noising steps guided by your text prompt, gradually resolving the noise into a coherent image. This explains the stochastic nature of generation (different random starting noise = different images), why multiple generations of the same prompt vary, and why the same prompt can produce images from noticeably different starting directions.

Question 2

Which platform is most appropriate for a marketing professional who needs high-aesthetic-quality images regularly as a core part of their workflow?

A) DALL·E 3 through ChatGPT, because natural language is easiest B) Midjourney, because it produces the highest aesthetic quality and rewards learning its workflow C) Stable Diffusion, because it is free and open source D) Any platform works equally well for professional marketing use

Answer

**B) Midjourney, because it produces the highest aesthetic quality and rewards learning its workflow** Midjourney consistently produces the highest aesthetic quality output among the major platforms, with sophisticated compositional sense and beautiful rendering that is immediately recognizable. For a professional whose workflow depends on image quality, the investment in learning Midjourney's parameters and workflow pays off rapidly. DALL·E 3 is better for convenience and occasional use. Stable Diffusion is better for maximum control and technical users. Each is the right choice for specific circumstances.

Question 3

What does the --chaos parameter control in Midjourney?

A) The level of artistic distortion or "weirdness" in the output B) The variation between the four images generated in a single prompt — low values produce similar variations, high values produce very different interpretations C) The number of steps in the generation process D) The contrast and color intensity of the output

Answer

**B) The variation between the four images generated in a single prompt — low values produce similar variations, high values produce very different interpretations** `--chaos` controls how much variety appears across the four images generated from a single prompt. `--chaos 0` produces four images that are very similar to each other. `--chaos 100` produces four images that may look completely different from each other while all technically responding to the same prompt. Use high chaos during exploration to find unexpected creative directions; use low chaos when you've found a direction you want to refine and want to see similar variations. The `--weird` parameter is what controls unusual aesthetic qualities.

Question 4

Why are AI-generated hands and fingers a well-known failure mode?

A) AI models were trained without any images containing hands B) Hands are legally restricted from AI training data C) Hands appear in enormous variety across training data (different angles, partial views, motion blur), creating a complex distribution that generates statistically plausible but anatomically incorrect results D) AI models lack the computational power to render hands correctly

Answer

**C) Hands appear in enormous variety across training data (different angles, partial views, motion blur), creating a complex distribution that generates statistically plausible but anatomically incorrect results** The conceptual model of diffusion generation helps explain this: the model generates something statistically consistent with training data. Hands appear in so many different configurations — different numbers of fingers visible, partially obscured, in motion, overlapping — that the "average" hand the model generates is often anatomically impossible. Areas with less visual variability in training data (simple objects, clear geometries, established compositional types) generate more reliably.

Question 5

A consultant needs to generate custom conceptual illustrations for a client presentation occasionally — perhaps once or twice a month. Which platform recommendation makes most sense?

A) Stable Diffusion locally installed for maximum control B) Midjourney for highest quality C) DALL·E 3 through ChatGPT for convenience, natural language prompting, and no new tool to learn D) They should use stock photos instead

Answer

**C) DALL·E 3 through ChatGPT for convenience, natural language prompting, and no new tool to learn** For occasional use, the overhead of learning Midjourney's Discord-based workflow and parameter system is not justified. DALL·E 3 through ChatGPT allows natural language description, iterative refinement through conversation, and does not require a new tool — if the person already uses ChatGPT, image generation is available directly. The output quality, while not at Midjourney's top end, is more than adequate for business presentation illustrations.

Question 6

What is the primary advantage of DALL·E 3 over other major image generation platforms for text within images?

A) DALL·E 3 is the only model that can include any text in images B) DALL·E 3 reliably generates readable, correctly spelled text within images — a capability other models handle poorly C) DALL·E 3 uses a separate OCR system to add text after generation D) DALL·E 3 can generate text in more languages than other models

Answer

**B) DALL·E 3 reliably generates readable, correctly spelled text within images — a capability other models handle poorly** Text rendering within generated images is a known weakness of diffusion models in general — they generate text as texture rather than discrete characters, producing words that look like text but are misspelled, blurred, or incoherent. DALL·E 3 (OpenAI's third-generation model) made significant improvements in text generation quality, producing readable and correctly spelled text much more reliably than Midjourney or Stable Diffusion. This makes DALL·E 3 the better choice when your image needs to include readable words.

Question 7

Which Stable Diffusion capability most distinguishes it from Midjourney and DALL·E 3 for users needing precise compositional control?

A) Higher resolution output B) ControlNet, which allows using reference images to precisely constrain composition, pose, and spatial arrangement C) Better aesthetic quality D) Faster generation speed

Answer

**B) ControlNet, which allows using reference images to precisely constrain composition, pose, and spatial arrangement** ControlNet is the key differentiator for precision composition work. It allows you to provide a control signal (a pose skeleton, edge detection, depth map, or sketch) that the model uses to constrain the spatial arrangement of the generated image. You can extract a human pose from a reference photograph and generate a new image with exactly that pose, or provide a rough sketch that establishes the layout while AI generates the final image. This level of compositional control is not available in Midjourney or DALL·E 3.

Question 8

What does Midjourney's /describe command do?

A) It generates a text description of what is happening in a Midjourney-generated image B) It takes an uploaded image and generates four prompt suggestions that might produce a similar image — useful for reverse-engineering visual styles C) It describes the technical parameters used to generate the most recent image D) It describes the platform's usage policies

Answer

**B) It takes an uploaded image and generates four prompt suggestions that might produce a similar image — useful for reverse-engineering visual styles** `/describe` is one of Midjourney's most useful learning tools. Upload any image (a photograph you like, a painting, a magazine page), and Midjourney returns four prompt suggestions that describe the image's visual qualities in Midjourney's vocabulary. This teaches you which terms Midjourney responds to for specific visual effects, and helps you understand how to describe styles you want to achieve. The generated prompts are starting points for iteration, not guaranteed to reproduce the original image exactly.

Question 9

What is the most important consideration when using AI-generated images in commercial work?

A) The image must be at least 300 DPI for print B) Understanding and complying with the specific commercial rights granted by the platform's current terms of service C) Ensuring the image was generated with at least 50 iteration steps D) Crediting the AI model used in all commercial materials

Answer

**B) Understanding and complying with the specific commercial rights granted by the platform's current terms of service** Platform terms of service govern commercial rights, and they differ between platforms and change over time. Midjourney grants commercial rights to paid subscribers (with some restrictions). OpenAI's terms govern DALL·E usage. Stable Diffusion, as open-source software, has different considerations. The legal landscape for AI-generated image rights is also unsettled across jurisdictions. Before commercial use, read the current platform terms — not a summary from months ago — and consider legal review for high-stakes commercial applications.

Question 10

A practitioner generates an image showing a realistic-looking product endorsement by a well-known celebrity. What is the primary concern?

A) The image quality may not be good enough B) Right of publicity and related legal concerns — using a real person's likeness without consent raises significant legal and ethical issues regardless of how the image was produced C) The image may not be saved in the right file format D) The celebrity's likeness is automatically trademarked

Answer

**B) Right of publicity and related legal concerns — using a real person's likeness without consent raises significant legal and ethical issues regardless of how the image was produced** Generating a photorealistic image of a real, identifiable person — particularly in a context that suggests endorsement or creates a false impression — raises serious legal concerns under right of publicity law, potential defamation law, and various advertising regulations. "The AI made it, not me" is not a legal defense. This is one of the clear "when not to" cases in AI image generation: generating images of real, specific people in contexts where they have not consented is legally and ethically problematic.

Question 11

What does the --sref [URL] parameter do in Midjourney?

A) Sets the server reference for the image to be stored B) Uses a referenced image as a style guide, applying its visual aesthetic to the prompt while generating new content C) Makes the image searchable by reference D) Adds the referenced image directly into the composition

Answer

**B) Uses a referenced image as a style guide, applying its visual aesthetic to the prompt while generating new content** `--sref` (style reference) tells Midjourney to use the visual style of the referenced image as a guide for the new generation. The referenced image's aesthetic qualities — color palette, rendering style, atmosphere, compositional approach — influence the output without directly copying the image's content. This is useful for applying a consistent visual style across multiple generated images and for achieving specific aesthetic targets that are easier to show than describe. The related `--cref` parameter does the same for maintaining character/subject consistency.

Question 12

For a marketing professional creating a campaign mood board from scratch, approximately how much time should a skilled Midjourney user expect to spend generating 8-10 high-quality images?

A) 5-10 minutes — modern AI tools are essentially instant B) All day — quality image generation requires many hours of work C) 60-120 minutes, including iteration to explore directions and refine toward the visual brief D) 2-3 days — the same as traditional mood board creation

Answer

**C) 60-120 minutes, including iteration to explore directions and refine toward the visual brief** This is the realistic estimate based on actual practitioner experience. Initial prompt development and exploring directions takes time. Multiple rounds of iteration (generation → evaluation → refinement) are typically needed to get images that genuinely capture a brief rather than approximate it. Image selection and curation takes additional time. The 60-120 minute range for a quality mood board reflects the genuine value proposition: significant time savings over traditional approaches (which could take days) while still requiring meaningful creative and iterative work.

Question 13

When should a professional NOT use AI image generation for business communication?

A) When they need images of people in office settings B) When the budget is over $500 C) When photographic accuracy of actual events, real people's actual appearances, or specific real spaces matters — or when passing AI-generated images as real photographs would be misleading D) When using images for internal presentations rather than external materials

Answer

**C) When photographic accuracy of actual events, real people's actual appearances, or specific real spaces matters — or when passing AI-generated images as real photographs would be misleading** AI-generated images are not photographs of real events or people. Using them in contexts where audiences reasonably expect real photographs — journalism, documentation, "behind the scenes" content — is misleading. The same concern applies to depicting real, specific people or places that were not photographed. The appropriate uses are clearly communicative: concept illustration, mood boards, abstract visual representation of ideas, creative direction exploration.

Question 14

What is the primary reason that more specific photography terminology (lens types, aperture values, camera models) often produces better photorealistic results in image generation?

A) AI models contain actual photographs taken with those specific cameras B) These technical terms appeared consistently in training data paired with images that had specific visual qualities, so the model learned to associate the terminology with those visual properties C) The models can simulate lens optics mathematics D) Photographers who submitted data to training sets tagged their images with this information

Answer

**B) These technical terms appeared consistently in training data paired with images that had specific visual qualities, so the model learned to associate the terminology with those visual properties** The relationship between photography terminology and visual output follows the same logic as all prompt engineering: the model learned from text-image pairs, and specific photography terms appeared consistently paired with specific visual qualities. "Shot on a 85mm portrait lens" appeared with images that have shallow depth of field and characteristic compression. "f/1.8 bokeh" appeared with images featuring background blur. The model learned these associations and can reproduce them. Using this vocabulary taps into those learned associations rather than relying on the model to interpret vague descriptions.

Question 15

What visual tells most reliably indicate that an image is AI-generated, at the current state of the technology?

A) Slightly blurry backgrounds B) Overly bright colors C) Anatomical anomalies (especially hands), background elements that become incoherent under examination, inconsistent lighting logic, and subtle skin texture issues D) Perfect symmetry in faces

Answer

**C) Anatomical anomalies (especially hands), background elements that become incoherent under examination, inconsistent lighting logic, and subtle skin texture issues** These remain the most reliable current indicators, though they are improving with each model generation. Hands and fingers remain the most consistent tell. Background elements often "dissolve" into statistical noise under close examination — objects merge, text becomes illegible strings of approximately-right letters, distant figures are ambiguous blobs. Lighting direction is sometimes inconsistent (a window casts light from one direction while a person's shadow contradicts it). These artifacts arise from the statistical nature of generation: the model produces outputs consistent with its training distribution at each level, but the global physical consistency (lighting, anatomy, geometry) that human photographers and artists maintain is harder to enforce.