Chapter 39 Exercises: Measuring Effectiveness

These exercises build your measurement practice from the ground up. The first section focuses on setup and baseline establishment; the second on analysis and improvement; the third on team and organizational measurement.


Section 1: Setting Up Your Measurement System

Exercise 1: Create Your Effectiveness Journal

Set up your AI effectiveness journal using a spreadsheet with the following columns:

Date Task Type Est. Without AI (min) Actual With AI (min) Time Saved (min) Iteration Rounds First Output (1-5) Quality Rating (1-5) Notes

Instructions: - Create this spreadsheet today - Complete it for at least five AI interactions before moving to Exercise 2 - In the Notes column, record anything notable: what worked particularly well, what failed, a prompt approach you want to try again

Reflection: After five entries, what do you notice? Does any pattern surprise you?


Exercise 2: Define Your Quality Dimensions

Quality looks different depending on your work. This exercise makes your quality standards explicit.

  1. List the three most common types of AI-assisted output you produce (e.g., "client emails," "data analysis reports," "code functions")

  2. For each output type, define 3-5 quality dimensions that matter: - What does "excellent" look like on this dimension? - What does "acceptable" look like? - What does "poor" look like?

  3. Create a simple 1-5 rating rubric for each dimension

Example for "research summary": - Accuracy: 5=all claims verified, 3=mostly verified with appropriate caveats, 1=unverified claims presented as fact - Completeness: 5=covers all relevant angles, 3=covers main points, 1=significant gaps - Clarity: 5=clear to a non-expert, 3=clear to someone in the field, 1=confusing even to an expert

You'll use these rubrics in Exercise 3 and throughout your ongoing measurement practice.


Exercise 3: Establish Your Baseline

Before you can measure improvement, you need a baseline.

Baseline task: Choose a task type you AI-assist regularly. Do three instances of this task:

  1. Complete the task with AI assistance as you normally would
  2. Rate the output quality using your rubric from Exercise 2
  3. Record time taken and iteration rounds in your journal

After three instances, calculate: - Average time taken - Average quality rating - Average iteration rounds - First-output usability rate (how often was the first AI output useful without major revision?)

This is your baseline. You'll return to it in six weeks to measure improvement.


Exercise 4: The Time Estimate Calibration Exercise

This exercise reveals systematic biases in how you estimate AI time savings.

For the next two weeks, before starting any AI-assisted task: 1. Write down your estimate: how long would this take without AI? 2. Complete the task with AI assistance and record the actual time

At the end of two weeks, compare your estimates to reality: - Are you consistently overestimating AI time savings? (You think it saves more than it does) - Are you consistently underestimating? (It saves more than you expect) - Does the bias vary by task type?

Reflection: What does your estimation pattern tell you about which tasks you're most vs. least accurate about?


Exercise 5: Calculate Your Current AI ROI

Using two weeks of effectiveness journal data:

  1. Sum your total time saved over the two weeks
  2. Multiply by your hourly time value (your compensation ÷ 2,000, or your billing rate)
  3. Annualize: multiply by 26 (two-week periods per year)
  4. Calculate annual AI subscription cost
  5. Compute ROI: annual time value ÷ annual cost

Is your ROI positive? By how much? If it's lower than you expected, examine your data: which task categories are generating the lowest time savings?


Section 2: Analysis and Improvement

Exercise 6: Task Category Analysis

After four weeks of tracking, conduct your first task category analysis.

Group your journal entries by task category and calculate for each: - Average time saved per interaction - Average iteration rounds - Average quality rating - First-output usability rate

Create a simple 2x2 grid: - X-axis: Time savings (Low ← → High) - Y-axis: Quality rating (Low ← → High)

Place each task category in the grid. What's in your high-value quadrant? What's in your low-value quadrant?

Write a 200-word reflection: What action does this analysis suggest? What will you do more of? Less of? Differently?


Exercise 7: The Iteration Efficiency Retrospective

Identify three AI interactions from your journal that required 5 or more iteration rounds.

For each, write a brief retrospective: 1. What was the task? 2. What was the first prompt? 3. Why did it not work well? 4. What did subsequent prompts try to fix? 5. In retrospect, what would an ideal first prompt have looked like?

Then write the ideal first prompt as if you were starting fresh. Save these improved prompts in your prompt library.

Reflection: Is there a pattern in what made these interactions difficult? Do they share common failure modes?


Exercise 8: The Batting Average Calculation

Using your effectiveness journal, calculate your AI batting average:

First-output scoring: - Used directly with minor edits: 1.0 - Good foundation, moderate editing needed: 0.7 - Useful but significant revision required: 0.4 - Didn't save time or was misleading: 0.0

Calculate your batting average overall and by task category.

Targets for reflection: - Above 0.7: What's working well here? Can you apply the same approach to lower-scoring categories? - 0.4-0.7: Where are the consistent friction points? What would improve this? - Below 0.4: Is this task worth AI-assisting at all? What would a different approach look like?


Exercise 9: The "Stop Doing" Analysis

This exercise is about productive elimination.

  1. List all the task categories where you currently use AI assistance
  2. For each, record from your journal: average time saved and average quality rating
  3. Identify any task categories with both below-average time savings AND below-average quality ratings
  4. For each low-value use case, decide: - Stop AI-assisting this task entirely - Try a fundamentally different approach (and specify what) - Continue but accept the limited value (and explain why)

  5. Implement your decisions for four weeks and re-measure

Most practitioners find at least one or two task categories that they've been AI-assisting out of habit rather than evidence. Eliminating or improving these makes the overall practice more efficient.


Exercise 10: The Prompt Improvement Experiment

Choose your lowest-batting-average task category. Design a structured experiment to improve your prompting approach:

  1. Define the task clearly
  2. Identify what you think is wrong with your current prompting approach
  3. Develop two alternative prompt structures to test
  4. Run each alternative on five instances of the task
  5. Compare batting averages and quality ratings

Report: - Did either alternative improve on your baseline? - What did you learn about why the original approach wasn't working? - What will you change in your ongoing approach?

This is the core of the improvement cycle: measurement → hypothesis → experiment → re-measurement.


Exercise 11: The Learning Curve Check

If you've been tracking for more than six weeks, this exercise checks your skill development trajectory.

  1. Calculate your batting average for your first two weeks of tracking
  2. Calculate it for your most recent two weeks
  3. Do the same for iteration rounds: average from the first two weeks vs. the most recent two weeks

Is your batting average higher? Are your iteration rounds lower?

If yes: great. Your practice is developing. What's driving the improvement?

If no: you've plateaued. What would you try to break out of the plateau? (Consider: new use cases, fundamentally different prompt structures, deeper domain context in prompts, better review workflows)


Section 3: Quality Measurement

Exercise 12: The Blind Comparison

This exercise requires a willing colleague.

  1. Produce two comparable outputs: one with full AI assistance and one with minimal or no AI assistance
  2. Ask your colleague to rate both outputs using your quality rubric (without knowing which is which)
  3. Compare the ratings

Reflection questions: - Was your colleague's quality assessment consistent with yours? - Was the AI-assisted version rated higher, lower, or equivalent? - Were there specific quality dimensions where AI assistance made a clear difference (positive or negative)? - What would you change about how you use AI for this task type based on the comparison?


Exercise 13: Build Your Error Catalog

Starting today, maintain a running error catalog: a record of mistakes AI has made in your specific use.

For each significant AI error: - Task type - Nature of the error (hallucination, logic error, outdated information, misunderstanding of context, etc.) - How you caught it - What it would have cost if undetected

After one month, analyze your error catalog: - Which error types are most common? - Which task categories generate the most errors? - Are you catching errors at the right point in your workflow?

Use this analysis to update your verification checklist for each task category.


Exercise 14: Client or Stakeholder Feedback Correlation

If your work receives external feedback (client ratings, approval/rejection, supervisor scores), track whether feedback correlates with AI assistance.

Over the next month, for each piece of work that receives external feedback: - Note whether it was AI-assisted (and to what degree) - Record the feedback - After 15-20 data points, look for correlation

Interpretation: - No correlation: AI assistance isn't systematically affecting quality in your external audience's assessment - Positive correlation: AI-assisted work is receiving better feedback — AI is improving your output quality - Negative correlation: AI-assisted work is receiving worse feedback — AI assistance is reducing quality in ways your external audience detects

If you see a negative correlation, this is the most important finding of your measurement practice. Address it before it compounds.


Section 4: Team Measurement

Exercise 15: Design a Team Measurement System

If you're in a position to influence how your team measures AI effectiveness, design a lightweight team measurement system:

  1. Define 3-5 metrics the team will track collectively
  2. Design the simplest possible data collection mechanism (weekly self-report, shared spreadsheet, review-integrated tracking)
  3. Specify how often aggregate data will be reviewed and by whom
  4. Define what actions different metric trends will trigger

The system should add no more than 10 minutes per week of overhead per team member. If it's more than that, it won't sustain.


Exercise 16: The Team Effectiveness Benchmark

If your team has been using AI tools for at least two months, benchmark your team's AI effectiveness against these questions:

  1. What percentage of team members are using AI for at least one significant task weekly?
  2. What is the team's aggregate time savings estimate (from self-reports)?
  3. Has the team's error rate or revision cycle frequency changed since AI adoption?
  4. What is the quality of the team's shared prompt library? (Number of entries, how often consulted, how recently updated)
  5. How often do team members share AI learnings with each other?

Compare your answers to where the team was two months ago. What's improved? What hasn't?


Bonus Exercise: Build Your Personal AI Dashboard

If you want to take your measurement practice to the next level, build a simple personal AI dashboard using a spreadsheet.

Your dashboard should show at a glance: - Total hours saved this month and year-to-date - Current batting average overall and by top 3 task categories - Current average iteration rounds and trend (up/down/flat) - Current monthly ROI calculation - Top 3 highest-leverage use cases - Top 3 use cases to improve or stop

Update it monthly. Review it before you make any changes to your AI practice. Let the data guide your development decisions.