Case Study 1: Redesigning a Poorly Designed Market

Overview

In this case study, we examine five poorly designed prediction markets that generated controversy, confused traders, and produced unreliable forecasts. For each market, we systematically identify every design flaw, explain how it caused problems, and then redesign the market with proper resolution criteria. We conclude by simulating user feedback on the redesigned markets.

Part 1: The Five Flawed Markets

Market A: "Will AI Take Over?"

Original specification: - Title: "Will AI take over?" - Close date: December 31, 2030 - Resolution: "Resolves YES if AI takes over." - Initial price: 0.50 - Platform: Fictional internal corporate prediction market

Trading history: The market attracted 200 traders and $50,000 in volume. The price oscillated between 0.05 and 0.35 with no clear trend, reflecting widespread confusion about what "take over" meant. A heated comment thread with 300+ posts debated whether "take over" meant (a) AGI surpassing human intelligence, (b) AI replacing most jobs, (c) AI being used in most business processes, (d) an AI system making autonomous decisions with significant real-world impact, or (e) a sci-fi-style robot uprising.

Problems identified:

Vague predicate ("take over"): This phrase has no standard definition and means radically different things to different people. It ranges from "AI is widely used" (already true by some definitions) to "AI enslaves humanity" (extremely unlikely).
No measurable criterion: There is no metric, threshold, or observable event that would trigger resolution. The phrase "resolves YES if AI takes over" is circular.
No resolution source: Who determines whether AI has "taken over"? The market creator? A vote? An expert panel? No source is specified.
Inappropriate initial price: Starting at 0.50 for an event with wildly uncertain definition and very low base rate (under most interpretations) invited immediate selling, burning through the AMM subsidy.
Overly long time horizon without interim checkpoints: A market lasting 6+ years without any interim resolution or checkpoint mechanism will suffer from extreme time-discounting effects and low engagement.
No edge case handling: What if "AI" is redefined? What if a narrow AI system causes massive disruption but is not "general"? What if the question becomes moot because the company hosting the market goes bankrupt?

Redesign:

We split this into three specific, measurable markets:

Market A1: "Will an AI system score above 90th percentile on a comprehensive evaluation suite (SuperGLUE, MMLU, and at least one novel benchmark created after 2025) by December 31, 2028, as reported in a peer-reviewed publication or major AI lab technical report?"

Market A2: "Will AI-related automation reduce total US employment (BLS Current Employment Statistics) by more than 5% from the January 2025 level at any point before January 1, 2030?"

Market A3: "Will an AI system autonomously execute a consequential real-world action (financial transaction > $1M, infrastructure control decision, or military engagement) without direct human authorization, as documented in a credible journalistic report (NYT, WSJ, Reuters, AP) before January 1, 2029?"

Each redesigned market has: - A specific, measurable criterion - An identified resolution source - A clear time frame - An appropriate initial price (A1: 0.40, A2: 0.08, A3: 0.15)

Market B: "Will There Be Peace in the Middle East?"

Original specification: - Title: "Will there be peace in the Middle East?" - Close date: December 31, 2026 - Resolution: "Resolves YES if peace is achieved in the Middle East." - Initial price: 0.50

Problems identified:

Undefined scope ("Middle East"): The Middle East encompasses multiple countries and conflicts. Which conflict? Israel-Palestine? Yemen civil war? Syrian civil war? Iran-Saudi tensions? All of them?
Undefined predicate ("peace"): Does "peace" mean a signed treaty? Cessation of hostilities? Absence of armed conflict? Normalization of diplomatic relations? "Peace" is one of the most contested terms in international relations.
Impossible to resolve: There is no single data source or event that would definitively establish "peace in the Middle East." Even experts would disagree on whether this criterion is met.
Inappropriate time frame for the scope: If the question encompasses all Middle East conflicts, two years is unreasonably short. If it focuses on one conflict, the question needs to specify which one.
No partial resolution mechanism: What if one conflict is resolved but another begins?

Redesign:

Market B1: "Will a formal ceasefire agreement between Israel and Hamas, acknowledged by both parties and verified by the UN Security Council, be in effect on December 31, 2026?" - Resolution source: UN Security Council records and official statements from both parties - Edge case: If Hamas ceases to exist as an organization, resolves N/A - Edge case: If a ceasefire is signed but violated within 7 days, it does not count as "in effect"

Market B2: "Will Israel and Saudi Arabia establish formal diplomatic relations (exchange of ambassadors or equivalent) before January 1, 2027?" - Resolution source: Official government announcements from both countries, confirmed by at least two Tier 3 media sources - Edge case: If relations are established then broken before the close date, resolves YES (they were established)

Market B3: "Will the total number of conflict-related fatalities in the Middle East and North Africa region, as reported by ACLED (Armed Conflict Location & Event Data), decrease by more than 50% in 2026 compared to 2025?" - Resolution source: ACLED annual data release - Backup: If ACLED data is unavailable, Uppsala Conflict Data Program (UCDP)

Market C: "Will Bitcoin Moon?"

Original specification: - Title: "Will Bitcoin moon?" - Close date: None specified - Resolution: "If Bitcoin moons" - Initial price: 0.50

Problems identified:

Slang terminology ("moon"): "Moon" is crypto slang for a dramatic price increase, but there is no agreed-upon threshold. Some would say 2x is "mooning," others would require 10x or more.
No time bound: Without a close date, the market could remain open indefinitely. Given that Bitcoin's price has historically been volatile, virtually any threshold will eventually be hit given enough time, making this a question about when, not if.
No reference point: "Moon" relative to what starting price? The price at market creation? An all-time high? A specific date?
No specific exchange or price feed: Bitcoin trades at slightly different prices on different exchanges. Which price matters?
No specification of "Bitcoin": If Bitcoin forks, which chain counts?

Redesign:

Market C1 (Binary): "Will the 4:00 PM UTC daily closing price of BTC/USD on Coinbase exceed $200,000 at any point during the period January 1, 2026 through December 31, 2026?" - Resolution source: Coinbase historical price data (publicly available API) - Backup: CoinGecko aggregate price data - Edge case: If Coinbase discontinues BTC/USD trading, use Kraken BTC/USD instead - Edge case: If Bitcoin undergoes a hard fork, this refers to the chain retaining the "BTC" ticker on Coinbase

Market C2 (Bracket): "What will the BTC/USD price be on December 31, 2026 at 4:00 PM UTC on Coinbase?" - Brackets: [0, $50K), [$50K, $100K), [$100K, $150K), [$150K, $200K), [$200K, $300K), [$300K+) - Same resolution source and edge cases as C1

Market D: "Will the Next iPhone Be Good?"

Original specification: - Title: "Will the next iPhone be good?" - Close date: December 31, 2026 - Resolution: Market creator decides based on reviews. - Initial price: 0.50

Problems identified:

Subjective predicate ("good"): "Good" has no objective meaning. What is good to one person may be mediocre to another.
Undefined subject ("next iPhone"): Which iPhone? The next numbered version? The next device released? What if Apple changes its naming scheme?
Resolution by market creator: This creates a massive conflict of interest. The market creator can trade and then resolve in their favor.
Vague resolution mechanism ("based on reviews"): Which reviews? How many? What threshold? Is a 4.0/5 average "good"? What about a mix of positive and negative reviews?
Inappropriate initial price: iPhones are generally well-received, so a base rate of "good" (by most reasonable interpretations) is probably above 80%. Starting at 0.50 misrepresents the prior.

Redesign:

Market D1: "Will the first iPhone model released after September 1, 2026 (by any name) receive an average review score of 4.0 or higher out of 5.0 on the first 20 reviews published by outlets in the following list: [The Verge, TechCrunch, Wired, CNET, Ars Technica, Tom's Guide, PCMag, Engadget, WSJ, NYT], within 30 days of the product's release?" - Resolution source: Published reviews from the specified outlets, scores normalized to a 5-point scale - Resolution method: Automated aggregation by platform staff, not market creator - Edge case: If fewer than 10 of the specified outlets publish reviews within 30 days, the threshold is 4.0 out of 5.0 on however many reviews are available (minimum 5) - Edge case: If no iPhone is released after September 1, 2026 and before January 1, 2027, market resolves N/A - Initial price: 0.75 (reflecting historical base rate)

Market E: "Will It Rain Tomorrow?"

Original specification: - Title: "Will it rain tomorrow?" - Close date: Tomorrow - Resolution: "If it rains" - Initial price: 0.50

Problems identified:

No location specified: Rain where? A market without a location is meaningless for a weather event.
Perpetually rolling "tomorrow": If this is a standing market, "tomorrow" changes every day. Even for a one-day market, "tomorrow" is relative to the viewer's time zone.
No definition of "rain": Any precipitation? Measurable precipitation (>0.01 inches)? Rain specifically (not snow, sleet, or hail)?
No time period within the day: Rain at any point during the day? For a minimum duration? During business hours?
No measurement source: Weather stations? Airport records? A specific weather service?

Redesign:

Market E1: "Will the National Weather Service (NWS) weather station at Chicago O'Hare International Airport (KORD) record measurable precipitation (>= 0.01 inches of liquid equivalent) during the 24-hour period from 12:00 AM to 11:59 PM Central Time on February 18, 2026?" - Resolution source: NWS preliminary climatological data for KORD - Backup: Weather Underground historical data for KORD - Edge case: If the KORD station is inoperative, use Chicago Midway (KMDW) instead - Initial price: Based on historical climatological average for the date and current NWS forecast probability of precipitation

Part 2: User Feedback Simulation

To test our redesigns, we simulate feedback from five user archetypes:

User Archetypes

Expert Trader (Alex): Experienced prediction market user, focuses on accuracy and edge cases
Casual Participant (Jordan): New to prediction markets, wants simplicity
Platform Operator (Sam): Concerned with scalability and dispute rates
Domain Expert (Riley): Deep knowledge of the specific subject matter
Adversarial Trader (Quinn): Looks for exploits and semantic loopholes

Feedback on Market A1 Redesign (AI Benchmarks)

User	Feedback	Severity
Alex	"Well-specified. I'd want clarity on what counts as a 'novel benchmark created after 2025' --- who determines novelty?"	Medium
Jordan	"I understand this one. AI test scores, got it."	Low
Sam	"Good for automation. Benchmark scores are machine-readable."	Low
Riley	"SuperGLUE is already saturated. Consider specifying unsaturated benchmarks or using a meta-benchmark."	High
Quinn	"Could an AI lab create a trivially easy 'novel benchmark' to satisfy the criterion? The word 'comprehensive' is doing a lot of work."	High

Response to feedback: Riley and Quinn raise valid concerns. Revised criterion: "...a comprehensive evaluation suite including at least three benchmarks from the following approved list: [MMLU, GPQA, SWE-bench, ARC-AGI, MATH-500, FrontierMath], scoring above 90th percentile of human expert performance on each..."

Feedback on Market B1 Redesign (Israel-Hamas Ceasefire)

User	Feedback	Severity
Alex	"Clean resolution criteria. Good edge case handling."	Low
Jordan	"What does 'in effect' mean if there are minor violations?"	Medium
Sam	"UN Security Council records are reliable but slow to publish."	Medium
Riley	"Hamas is a decentralized organization. Who speaks for 'Hamas' in acknowledging a ceasefire?"	High
Quinn	"If a ceasefire is 'in effect' on Dec 31 but was signed Dec 30, does that count even if it collapses on Jan 1?"	Medium

Response to feedback: Riley's point is critical. Revised: "...acknowledged by Hamas's Political Bureau or recognized senior leadership..." Quinn's scenario is addressed by the existing specification (must be "in effect on December 31, 2026" --- a ceasefire signed Dec 30 that is still holding on Dec 31 counts). Add clarification: "The ceasefire must have been in effect for at least 48 continuous hours as of December 31, 2026, 23:59 UTC."

Feedback Summary Table

Original Market	Redesign Quality Score (1-10)	Main Remaining Issue	Resolution
A: "AI Take Over"	8/10	Benchmark novelty definition	Use approved benchmark list
B: "Middle East Peace"	7/10	Hamas organization definition	Specify recognized leadership
C: "Bitcoin Moon"	9/10	Exchange discontinuation unlikely but addressed	No change needed
D: "iPhone Good"	8/10	Review score normalization	Provide explicit conversion table
E: "Rain Tomorrow"	9/10	Very clean; NWS data is reliable	No change needed

Part 3: Design Principles Extracted

From these five redesigns, we extract general principles:

Principle 1: Decompose Vague Questions into Specific Sub-Questions

The vague "Will AI take over?" became three specific, measurable markets. This decomposition preserves the spirit of the original question while making each component resolvable.

Principle 2: Every Key Term Needs a Definition

"Take over," "peace," "moon," "good," "rain" --- every predicate and subject must be pinned to a specific, measurable definition. If a term appears in the question, it must appear (defined) in the resolution criteria.

Principle 3: Resolution Sources Must Be Named in Advance

"Market creator decides" is never acceptable for serious markets. The resolution source must be a specific, named, publicly accessible data source identified before trading begins.

Principle 4: Edge Cases Must Be Addressed Proactively

The adversarial trader archetype (Quinn) finds loopholes that the designer missed. Every market should be reviewed by someone actively trying to exploit it before publication.

Principle 5: Initial Prices Should Reflect Base Rates

Starting every market at 0.50 wastes subsidy and sends a misleading signal. Use available base rates, expert estimates, or reference class forecasting to set informative initial prices.

Principle 6: Simulated User Feedback Improves Design

Even simulated feedback from different archetypes (expert, casual, operator, domain expert, adversarial) reveals issues that a single designer would miss. A structured review process is essential.

Part 4: Code Implementation

See code/case-study-code.py for Python implementations of: - A market quality evaluator that scores a market specification against the SMART criteria - A simulated user feedback generator - A market redesign suggestion engine

Discussion Questions

Is there a point where too much precision in resolution criteria actually harms market participation? Where is the optimal balance?
For Market B (Middle East peace), is it better to create one comprehensive market or multiple narrow markets? What are the trade-offs for information aggregation?
How should platforms handle markets that were well-designed at creation but become problematic due to unforeseen real-world changes (e.g., a company rebranding, a data source being discontinued)?
The adversarial trader archetype (Quinn) found potential exploits in several redesigns. Should platforms incentivize adversarial review of new markets? How?
Consider the ethical dimension: should prediction markets exist for topics like "Will there be peace?" even with proper resolution criteria? Does monetizing predictions about conflict create perverse incentives?