Exercises: Chapter 27

DataField.Dev

Exercises: Chapter 27

Working with Geospatial Data: Maps, Spatial Joins, and Location-Based Analysis

Exercise 1: Geometry Basics and CRS (Conceptual)

You have a dataset of 10,000 customer addresses that have been geocoded to latitude and longitude. The data is stored in a CSV with columns customer_id, lat, lon, monthly_spend, churned.

a) Write the code to load this CSV into a GeoDataFrame with the correct CRS. What CRS should you use, and why?

b) A colleague creates a GeoDataFrame from the same CSV but forgets to set the CRS. They then compute the distance between two customers using .distance(). The result is 0.0312. What units is this in? Why is this number meaningless for business purposes?

c) To compute the actual distance in kilometers between two customers in New York City, you could either (1) reproject to EPSG:32618 (UTM Zone 18N) and use .distance(), or (2) use the haversine formula on the original WGS84 coordinates. Under what circumstances would you prefer each approach?

d) You receive a shapefile of US ZIP code boundaries. The .crs attribute shows EPSG:4269 (NAD83). Your customer data is in EPSG:4326 (WGS84). For most practical purposes, NAD83 and WGS84 are nearly identical (within ~2 meters). Should you still reproject before a spatial join? Why or why not?

e) A data engineer provides you with a dataset where the latitude column has values ranging from -122 to -71 and the longitude column has values ranging from 25 to 48. What went wrong? Write a fix.

Exercise 2: Spatial Joins (Conceptual + Code)

import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, box

np.random.seed(42)

# Service regions
regions = gpd.GeoDataFrame({
    'region': ['Northwest', 'Northeast', 'Southwest', 'Southeast'],
    'manager': ['Kim', 'Patel', 'Garcia', 'Williams'],
    'support_tier': ['premium', 'premium', 'standard', 'standard'],
}, geometry=[
    box(-125, 42, -104, 49),     # Northwest
    box(-104, 42, -67, 49),      # Northeast
    box(-125, 25, -104, 42),     # Southwest
    box(-104, 25, -67, 42),      # Southeast
], crs='EPSG:4326')

# Customer locations
n = 500
customers = gpd.GeoDataFrame({
    'customer_id': range(n),
    'monthly_spend': np.random.lognormal(5, 1, n).round(2),
    'churned': np.random.binomial(1, 0.18, n),
}, geometry=gpd.points_from_xy(
    np.random.uniform(-125, -67, n),
    np.random.uniform(25, 49, n),
), crs='EPSG:4326')

a) Perform a spatial join to assign each customer to a region. How many customers fall outside all four regions? (Hint: the regions do not cover Alaska, Hawaii, or the gaps between the boxes.)

b) Calculate the churn rate and average monthly spend for each region. Which region has the highest churn rate?

c) A product manager asks: "What percentage of our premium-tier customers have churned?" Use the spatial join result to answer this. Could you have answered this without the spatial join?

d) You need to identify customers who are within 100 km of the boundary between the premium and standard support tiers (the line at latitude 42). Write a pipeline that: (1) reprojects to a meter-based CRS, (2) creates a buffer zone around the boundary, (3) identifies customers within that buffer. Why might these boundary customers be interesting for analysis?

e) The spatial join above uses predicate='within'. If you changed it to predicate='intersects', would the results differ for point-in-polygon queries? When does the distinction matter?

Exercise 3: Building a Choropleth Map (Code)

import numpy as np
import pandas as pd
import geopandas as gpd

np.random.seed(42)

# Simulated sales data by US state
states = [
    'California', 'Texas', 'Florida', 'New York', 'Illinois',
    'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
    'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
    'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Colorado',
    'Wisconsin', 'Minnesota', 'South Carolina', 'Alabama', 'Louisiana',
]

sales_data = pd.DataFrame({
    'state': states,
    'total_sales': np.random.lognormal(15, 1.2, len(states)).round(0),
    'n_customers': np.random.randint(500, 15000, len(states)),
    'avg_order_value': np.random.uniform(45, 180, len(states)).round(2),
    'return_rate': np.random.uniform(0.03, 0.18, len(states)).round(3),
})
sales_data['sales_per_customer'] = (
    sales_data['total_sales'] / sales_data['n_customers']
).round(2)

a) Create a folium choropleth map showing return_rate by state. Use the 'YlOrRd' color scale. Save it to an HTML file. What do you observe about the geographic pattern of return rates?

b) Create a second choropleth showing sales_per_customer. Use the 'Blues' color scale. Does the geographic pattern differ from the return rate map?

c) The folium choropleth requires a GeoJSON file with state boundaries. The standard source is https://raw.githubusercontent.com/python-visualization/folium/main/tests/us-states.json. The key_on parameter must match a property in the GeoJSON to the state name column in your data. What happens if a state name in your DataFrame does not match the GeoJSON property exactly? How would you debug this?

d) Create a static matplotlib choropleth using geopandas .plot(). Compare the workflow to folium. When would you prefer each approach?

e) Add a layer of circle markers to the folium map showing n_customers at each state's approximate centroid. Scale the circle radius proportionally to customer count. (Hint: use approximate centroid coordinates --- exact centroids require actual state boundary polygons.)

Exercise 4: Location-Based Feature Engineering (Code)

import numpy as np
import pandas as pd

np.random.seed(42)

# ShopSmart fulfillment centers
fulfillment_centers = pd.DataFrame({
    'center_id': ['FC_NYC', 'FC_LA', 'FC_CHI', 'FC_DAL', 'FC_ATL'],
    'lat': [40.713, 34.052, 41.878, 32.777, 33.749],
    'lon': [-74.006, -118.244, -87.630, -96.797, -84.388],
    'max_daily_shipments': [8000, 12000, 6000, 5000, 7000],
})

# Customer orders
n_orders = 2000
customer_lats = np.random.uniform(25, 48, n_orders)
customer_lons = np.random.uniform(-125, -67, n_orders)

orders = pd.DataFrame({
    'order_id': range(n_orders),
    'customer_lat': customer_lats.round(4),
    'customer_lon': customer_lons.round(4),
    'order_value': np.random.lognormal(4, 0.8, n_orders).round(2),
    'delivery_days': np.random.exponential(3.5, n_orders).round(1).clip(1, 14),
    'satisfaction': np.random.randint(1, 6, n_orders),
})

a) Write a haversine_vectorized function and use it to compute dist_nearest_fc_km and nearest_fc for each order. What is the mean and median distance to the nearest fulfillment center?

b) Create distance bins (<200km, 200-500km, 500-1000km, >1000km) and compute the average delivery time and satisfaction score for each bin. Is there a clear relationship?

c) Engineer a fc_count_within_500km feature: for each order, count how many fulfillment centers are within 500 km. What is the distribution of this feature? How would you expect it to relate to delivery speed?

d) Build a simple model (gradient boosting) to predict satisfaction using three feature sets: (1) order_value and delivery_days only, (2) add dist_nearest_fc_km, (3) add all spatial features (dist_nearest_fc_km, fc_count_within_500km, nearest_fc encoded). Report cross-validated MAE for each. Does geospatial feature engineering improve the model?

e) A business stakeholder asks: "Where should we build our sixth fulfillment center to minimize average customer distance?" Propose an approach using the data above. (Hint: consider grid search over candidate locations.)

Exercise 5: Projection Pitfalls (Conceptual + Code)

a) You compute the area of Texas using polygon.area on a GeoDataFrame in EPSG:4326. The result is 55.2. A colleague computes it after reprojecting to EPSG:5070 (Albers Equal Area Conic) and gets 6.95e11. Explain both numbers. Which is correct for reporting the area of Texas in square meters?

b) You create a 10 km buffer around a point in San Francisco using EPSG:4326 (i.e., point.buffer(10000)). Describe what happens. Then write the correct code to create a 10 km buffer using an appropriate projected CRS.

c) The haversine formula assumes a spherical Earth. The Vincenty formula assumes an ellipsoidal Earth. For data science applications (not surveying), does the difference matter? At what scale of analysis would you need the Vincenty formula?

d) You are analyzing delivery routes across all of Europe. Which projected CRS would you choose, and why? (Hint: consider EPSG:3035, ETRS89/LAEA Europe.)

e) A dataset contains coordinates in EPSG:3857 (Web Mercator), which is what Google Maps and OpenStreetMap use internally. Why is Web Mercator unsuitable for area calculations? At what latitude does the distortion become extreme?

Exercise 6: End-to-End Geospatial Pipeline (Code)

Build a complete geospatial analysis pipeline for StreamFlow's churn prediction:

import numpy as np
import pandas as pd

np.random.seed(42)

# StreamFlow customer data
n = 3000
states = ['CA', 'TX', 'NY', 'FL', 'IL', 'PA', 'OH', 'GA', 'NC', 'WA',
          'NJ', 'VA', 'AZ', 'MA', 'CO', 'TN', 'MI', 'MO', 'MD', 'IN']

state_churn_rates = {
    'CA': 0.10, 'TX': 0.18, 'NY': 0.11, 'FL': 0.22, 'IL': 0.15,
    'PA': 0.13, 'OH': 0.20, 'GA': 0.19, 'NC': 0.16, 'WA': 0.08,
    'NJ': 0.12, 'VA': 0.14, 'AZ': 0.24, 'MA': 0.09, 'CO': 0.11,
    'TN': 0.19, 'MI': 0.21, 'MO': 0.17, 'MD': 0.13, 'IN': 0.20,
}

customers = pd.DataFrame({
    'customer_id': range(n),
    'state': np.random.choice(states, n),
    'plan': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.5, 0.35, 0.15]),
    'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 60),
    'monthly_revenue': np.random.lognormal(4.5, 0.8, n).round(2),
    'support_tickets': np.random.poisson(2, n),
})

customers['churned'] = [
    np.random.binomial(1, state_churn_rates[s]) for s in customers['state']
]

a) Compute state-level churn rates. Identify the top 5 highest-churn states.

b) Engineer a state_churn_rate feature by computing leave-one-out mean encoding (to avoid leakage). Explain why naive target encoding leaks.

c) Add two more spatial features of your design. Justify why each should be predictive of churn.

d) Train a GradientBoostingClassifier with and without the spatial features. Report 5-fold cross-validated AUC. How much lift do the spatial features provide?

e) Use SHAP or feature importances to identify the most important spatial feature. Write a one-paragraph summary of your findings for a non-technical stakeholder.

Exercise 7: Map Design for Stakeholders (Applied)

You have completed the StreamFlow churn analysis and need to present the results to the VP of Customer Success.

a) Design a choropleth map specification: which variable to display, which color scale, how many bins, what legend labels. Justify each choice for a non-technical audience.

b) The VP asks: "Can you overlay the locations of our data centers on the churn map?" Describe how you would add this layer. What insight might this combined view reveal?

c) The VP notices that Arizona and Florida have high churn and asks: "Is it the heat?" Write three alternative hypotheses that could explain geographic churn variation. For each, describe what data you would need to test it.

d) A colleague suggests using a hex-bin map instead of a state choropleth. Describe the tradeoff. Under what circumstances would hex-bins be more informative?

e) Create a mock dashboard layout (describe it in text --- bullet points with sections and chart types) that combines the choropleth with supporting charts (bar chart of churn by state, scatter plot of distance vs. satisfaction, time series of churn by region). What story does this dashboard tell?

Return to the chapter for full context.