Case Study 2: Zillow's Zestimate — When Regression Models Meet the Real Estate Market

DataField.Dev

Case Study 2: Zillow's Zestimate — When Regression Models Meet the Real Estate Market

Introduction

In 2006, Zillow launched the Zestimate — an automated home valuation tool that used regression and machine learning to estimate the market value of over 100 million homes in the United States. It was, by any measure, an ambitious application of the techniques covered in this chapter: multiple regression, feature engineering, ensemble methods, and continuous model refinement at massive scale.

For fifteen years, the Zestimate was a remarkable success. It became the most widely recognized automated valuation model (AVM) in the world. Homeowners checked their Zestimates obsessively. Real estate agents used them (sometimes grudgingly) as conversation starters. The tool drove billions of pageviews to Zillow's platform, establishing the company as the dominant brand in online real estate.

Then, in 2021, Zillow made a fateful decision: it bet billions of dollars on its own model's predictions. The company launched Zillow Offers, an iBuying program that used Zestimate-derived models to buy homes directly from sellers at algorithmically determined prices.

Within eighteen months, Zillow Offers had lost approximately $881 million. Zillow laid off 2,000 employees — 25 percent of its workforce. The division was shut down entirely.

This case study examines both sides of the Zestimate story: the remarkable engineering achievement of building a regression model that reasonably estimates home values at continental scale, and the catastrophic consequences of overconfidence in that model's predictions when real money was at stake.

Phase 1: Building the Zestimate (2006-2015)

The Regression Problem

Real estate valuation is a natural regression problem. The target variable is the home's market value (a continuous number). The input features include property characteristics (square footage, bedrooms, bathrooms, lot size, year built), location factors (neighborhood, school district, proximity to amenities), and market conditions (recent comparable sales, local price trends, interest rates).

Zillow's initial Zestimate was built on a multiple regression framework:

estimated_value = f(property_features, location_features, market_features)

The core challenge was not the algorithm — it was the data. Zillow needed to assemble a nationwide database of: - Property characteristics from county tax assessor records (public but fragmented, inconsistent, and often outdated) - Transaction prices from recorded sales (public, but with a time lag and coverage gaps) - Listing data from MLS (Multiple Listing Service) databases (partially accessible, often gated) - Geographic data for neighborhood-level features (school quality, crime rates, walkability, commute times)

The Feature Engineering Challenge

Home valuation required extensive feature engineering — many of the techniques described in Section 8.8:

Location encoding. Latitude and longitude alone are insufficient. The value of a home at coordinates (40.7128, -74.0060) depends on the neighborhood at those coordinates — and neighborhood quality can change dramatically over a few blocks. Zillow developed spatial features that encoded local context: average nearby home values, price trends within a radius, distance to transit, parks, and commercial districts.

Comparable sales (comps). The most powerful feature for predicting a home's value is the recent sale price of similar nearby homes. Zillow engineered "comp features" — statistical summaries of recent sales within a defined radius that matched on key attributes (bedrooms, square footage, age). This is conceptually similar to target encoding (Section 8.8).

Temporal features. Home values are time-dependent. A home's value in January 2024 depends on market conditions in January 2024, not January 2020. Zillow incorporated market trend features — rolling averages of price changes at the ZIP code, county, and metro level — to capture momentum and seasonality.

Interaction effects. Square footage matters, but its impact depends on location. An extra 500 square feet in Manhattan adds far more value than the same square footage in rural Kansas. Zillow's models included location-by-feature interactions to capture this spatial heterogeneity.

The Algorithm Evolution

Zillow's modeling approach evolved through the same progression covered in this chapter:

Multiple linear regression (2006-2009): The initial Zestimate used standard regression with property and location features. Median error was approximately 14 percent — meaning the Zestimate was within 14 percent of the eventual sale price for the median home.
Regularized regression with engineered features (2009-2012): Ridge and Lasso regression with more sophisticated features (comp-based features, spatial features, trend features) reduced median error to approximately 8-10 percent.
Ensemble methods (2012-2016): Random Forest and gradient boosting models captured nonlinear relationships and interaction effects that linear models missed. Median error dropped to approximately 6-8 percent.
Neural networks and deep learning (2016-present): Zillow incorporated neural network models, including convolutional networks for processing property photographs and recurrent networks for temporal price patterns. By 2019, the national median error was approximately 5 percent for on-market homes and 7.5 percent for off-market homes.

Business Insight. The distinction between on-market and off-market accuracy is critical. When a home is listed for sale, Zillow has access to the listing price, agent description, and updated photographs — all of which dramatically improve prediction accuracy. Off-market homes (not currently listed) rely on older data and are inherently harder to value. This distinction would prove consequential in the iBuying disaster.

The Zillow Prize Competition

In 2017, Zillow launched the Zillow Prize — a $1.2 million Kaggle competition to improve the Zestimate. Over 3,800 teams competed. The winning approaches used stacked ensembles combining gradient boosting, neural networks, and specialized spatial models. The competition improved Zillow's median error by approximately 13 percent relative to the existing Zestimate.

Key winning strategies: - Feature engineering dominated algorithm selection. Top teams spent far more time on features (neighborhood-level aggregations, temporal patterns, spatial interpolation) than on model architecture. - Ensemble stacking was universal. No single algorithm won. Every top team combined multiple models — typically XGBoost, LightGBM, and neural networks — using stacking or blending. - Outlier handling was critical. The competition evaluated accuracy on the log of the predicted/actual ratio. Teams that carefully handled outliers (foreclosure sales, luxury properties, data errors) gained disproportionate accuracy.

Phase 2: The Zestimate as Brand Asset (2015-2019)

By 2019, the Zestimate had become one of the most recognized AI products in the consumer world. Approximately 200 million homes had Zestimates. Over 100 million unique users visited Zillow monthly, many primarily to check their home's estimated value.

The Zestimate's value to Zillow was primarily indirect. It drove traffic, which drove advertising revenue (Zillow's core business model was selling leads to real estate agents), which drove profitability. The model didn't need to be perfect — it needed to be good enough to be useful and engaging.

This is a critical distinction. A model with a 5-7 percent median error is highly useful for: - Consumers wanting a rough sense of their home's value - Buyers screening neighborhoods and properties - Agents having informed initial conversations with clients - Investors conducting preliminary market analysis

A 5-7 percent error means the Zestimate for a $500,000 home was typically within $25,000-$35,000 of the actual sale price. Not precise enough for a legal appraisal, but enormously useful for orientation and comparison.

Definition. The median absolute percentage error of the Zestimate measures accuracy by taking the median (not mean) of the absolute percentage difference between the Zestimate and the actual sale price, across all homes sold. The median is used instead of the mean because home valuation errors are right-skewed — a few extremely unusual sales (distressed properties, luxury estates) would inflate the mean disproportionately.

Phase 3: The iBuying Bet (2018-2021)

In 2018, Zillow launched Zillow Offers, entering the "iBuying" market. The concept was simple: use the Zestimate (with enhancements) to make instant cash offers to homeowners, buy the home, make minor repairs, and resell it within 90 days.

The business logic seemed compelling: - Zillow already had the best home valuation model in the industry - The model's accuracy had been improving steadily for over a decade - The iBuying market was growing, with Opendoor proving the concept - Buying and selling homes was a far larger revenue opportunity than advertising

There was just one problem: the Zestimate was designed to be directionally accurate for consumer engagement, not transactionally accurate for billion-dollar purchasing decisions.

Where the Model Failed

1. The precision gap.

The Zestimate's 5-7 percent median error sounds good in abstract terms. But in iBuying, where Zillow's margin on each transaction was approximately 2-5 percent, even a 3 percent pricing error could turn a profit into a loss. The model's accuracy was sufficient for its original purpose (engagement) but insufficient for its new purpose (transactions).

Consider: if Zillow buys a home for $500,000 based on a model estimate, and the actual market value turns out to be $475,000 (a 5 percent error), Zillow has lost $25,000 — plus renovation costs, holding costs, and transaction costs. Multiply this by thousands of homes, and the losses compound rapidly.

2. The adverse selection problem.

Homeowners who accepted Zillow's instant offers were disproportionately those who knew their homes were worth less than the Zestimate suggested. A homeowner whose home was worth more than the Zestimate would list on the open market. A homeowner whose home was worth less would happily accept Zillow's inflated offer.

This is a classic adverse selection problem — the same dynamic that plagues insurance markets. Zillow's model was accurate on average, but the homes Zillow actually bought were systematically skewed toward overvaluation.

3. The market timing risk.

In late 2021, the U.S. housing market began shifting. After two years of unprecedented price increases driven by low interest rates, remote work migration, and supply constraints, prices began to plateau and, in some markets, decline. Zillow's models, trained primarily on recent data showing aggressive price appreciation, continued to predict rising prices.

This is the extrapolation problem described in Section 8.12 — the model had been trained during a period of relentless price increases and had no experience with market corrections. When conditions changed, the model's predictions diverged from reality.

4. The operational complexity.

Buying, renovating, and reselling homes is a fundamentally different business from running a website. Each home is unique. Renovation costs are uncertain. Local market dynamics vary block by block. The operational execution risk was enormous — and the regression model, no matter how accurate, could not account for a leaking roof discovered during renovation, a zoning change that reduced property values, or a buyer's market that extended selling timelines.

The Collapse

By October 2021, Zillow had accumulated approximately 7,000 homes — far more than it could sell in a softening market. Internal analysis revealed that the iBuying division was buying homes above market value at an alarming rate. The company announced it would stop buying homes, write down $569 million in inventory losses, and lay off 2,000 employees.

The total loss on the Zillow Offers experiment was approximately $881 million.

CEO Rich Barton acknowledged: "We've determined the unpredictability in forecasting home prices far exceeds what we anticipated."

Caution. Barton's statement contains an important lesson about regression models in business: uncertainty is not just a statistical concept — it is a financial risk. A model with a 5 percent MAPE does not just make 5 percent errors on average. It occasionally makes 15 or 20 percent errors. When those errors are directional (consistently over-predicting) and the stakes are high (millions of dollars per transaction), the tail risk can be catastrophic.

Phase 4: Lessons Learned (2022-Present)

After the iBuying shutdown, Zillow refocused on its core advertising business and continued improving the Zestimate for its original consumer engagement purpose. The model remains the industry benchmark for automated home valuation.

The Evolving Zestimate

Post-iBuying, Zillow has made several improvements to the Zestimate: - Neural Zestimate: A deep learning model that processes property photographs, floor plans, and satellite imagery alongside structured data. Images capture information that structured features miss — the quality of finishes, curb appeal, views, natural light. - Market condition adjustment: Enhanced sensitivity to real-time market shifts, including interest rate changes, local inventory levels, and days-on-market trends. - Uncertainty quantification: The Zestimate now displays a confidence range rather than a single point estimate — e.g., "$485,000 to $535,000" rather than just "$510,000." This is a direct response to the iBuying lesson that point estimates without uncertainty ranges are dangerously misleading.

Regression Lessons for Business Leaders

1. Accuracy Is Relative to Purpose

A 5 percent error is excellent for consumer engagement and terrible for transactional pricing. Before deploying a regression model, define the accuracy threshold required by the specific business use case — and honestly assess whether your model can meet it.

2. Average Accuracy Masks Tail Risk

A model with a 5 percent median error has, by definition, errors above 5 percent for half of its predictions. Some of those errors will be large. If the cost of large errors is disproportionate — as in iBuying, where a 15 percent overestimate wipes out all margin — then median or mean accuracy metrics are insufficient. You need to examine the error distribution, particularly the 90th and 99th percentile errors.

3. Adverse Selection Defeats Average Accuracy

When the counterparty to your model-based decision can choose whether to transact, you will systematically see worse-than-average model performance. The people who accept your offer are the ones who know the offer is too generous. This dynamic applies to insurance pricing, lending, iBuying, and any market where one party has information the model lacks.

4. Models Trained on Bull Markets Fail in Bear Markets

Zillow's models were trained primarily during a period of rising home prices. They had limited experience with declining markets and no experience with rapid market corrections. When the regime changed, the models extrapolated the recent trend rather than recognizing the structural shift. Always ask: "What happens to this model if conditions reverse?"

5. Point Estimates Need Confidence Intervals

A regression model that outputs "$510,000" without a confidence range creates a false sense of precision. A model that outputs "$485,000 to $535,000 with 90 percent confidence" communicates the uncertainty inherent in any prediction. Post-iBuying, Zillow adopted confidence ranges — a practice that every business deploying regression models should follow.

6. The Model Is Not the Business

The Zestimate was an excellent model. Zillow Offers was a failed business. The gap between the two was not algorithmic — it was operational, strategic, and risk-management-related. A model is a tool. The business decisions made with that tool — how much capital to deploy, how much inventory to accumulate, how to manage adverse selection, how to respond to market changes — determine success or failure.

Discussion Questions

Zillow's Zestimate was highly successful as a consumer engagement tool but insufficient for transactional pricing. How should organizations evaluate whether a model is "good enough" for a specific use case? What framework would you propose?
The adverse selection problem meant that Zillow systematically overpaid for homes. Could Zillow have mitigated this with a different model, or is adverse selection a business problem that no model can solve?
After the iBuying failure, Zillow added confidence ranges to the Zestimate. How should a business decide the appropriate confidence level to display — 80 percent? 90 percent? 95 percent? What are the trade-offs?
Opendoor, Zillow's main iBuying competitor, survived the market downturn (though with significant losses). Research Opendoor's approach. What did they do differently from Zillow in terms of model design, risk management, or operational strategy?
Zillow's CEO said the "unpredictability in forecasting home prices far exceeds what we anticipated." Is this a failure of the model, a failure of the business strategy, or a failure of risk management? Justify your answer.

Sources and Further Reading

Parker, W. (2021). "Zillow Quits Home-Flipping Business, Cites Inability to Forecast Prices." The Wall Street Journal, November 2, 2021.
Wiggins, C. (2021). "How Zillow's Homebuying Debacle Happened." Bloomberg Businessweek, November 8, 2021.
Humphries, S. (2019). "Building the Zestimate: Zillow's Recipe for Home Valuation." Zillow Research.
Kaggle. (2018). "Zillow Prize: Zillow's Home Value Prediction." kaggle.com.
Buchak, G., Grenadier, S., & Matvos, G. (2022). "iBuyers: Liquidity in Real Estate Markets." National Bureau of Economic Research Working Paper.
Zillow Group Inc. (2021). Q3 2021 Earnings Report and Shareholder Letter.
Caplin, A., & Leahy, J. (2011). "Trading Frictions and House Price Dynamics." Journal of Money, Credit and Banking, 43(7), 283-303.

This case study connects to Chapter 8's discussion of regression accuracy, the gap between training and deployment performance, overfitting to recent data, and the asymmetric costs of prediction errors. The Zillow story will be revisited in Chapter 11 (Model Evaluation and Selection) as an example of how evaluation metrics must align with business objectives.