> Core Principle --- Most data scientists encounter geospatial data. Few are prepared for it. The moment your dataset includes latitude and longitude columns, ZIP codes, addresses, county names, or any other location identifier, you have geospatial...
In This Chapter
- Maps, Spatial Joins, and Location-Based Analysis
- Why Geospatial Data Is Different
- Part 1: Geospatial Data Fundamentals
- Part 2: Coordinate Reference Systems and Projections
- Part 3: Spatial Joins and Point-in-Polygon
- Part 4: Creating Maps with Folium
- Part 5: Engineering Location-Based Features
- Part 6: Geocoding and Real-World Data Challenges
- Part 7: Putting It All Together --- The ShopSmart Delivery Analysis
- Chapter Summary
Chapter 27: Working with Geospatial Data
Maps, Spatial Joins, and Location-Based Analysis
Learning Objectives
By the end of this chapter, you will be able to:
- Work with geospatial data using geopandas (GeoDataFrames, geometry columns)
- Perform spatial joins and point-in-polygon queries
- Create choropleth maps and point maps with folium
- Engineer location-based features (distance to nearest X, count within radius)
- Handle coordinate reference systems (CRS) and projections
Why Geospatial Data Is Different
Core Principle --- Most data scientists encounter geospatial data. Few are prepared for it. The moment your dataset includes latitude and longitude columns, ZIP codes, addresses, county names, or any other location identifier, you have geospatial data --- and treating those columns as ordinary numerical or categorical features is a mistake. Latitude and longitude are not independent features. A one-degree change in latitude near the equator means something different from a one-degree change near the poles. ZIP codes are not ordinal. Distance between two points is not Euclidean on a sphere. This chapter gives you the practical toolkit to handle geospatial data correctly: loading shapefiles and GeoJSON, building GeoDataFrames, performing spatial joins, creating maps, engineering location-based features, and avoiding the projection errors that silently corrupt analysis.
This is not a GIS course. You will not learn cartographic theory, remote sensing, or spatial statistics at the level a geographer would. What you will learn is the 20% of geospatial tools that solve 80% of the location problems a data scientist encounters: "Which region has the highest churn?" "How far is each customer from the nearest warehouse?" "Do delivery times vary by geography?" "Is there a spatial pattern in our sales data?"
The tools are geopandas for spatial data manipulation, shapely for geometry operations, folium for interactive maps, and scikit-learn for integrating spatial features into machine learning models. If you can write pandas code, you can write geopandas code. The API is nearly identical --- with one critical addition: a geometry column.
Part 1: Geospatial Data Fundamentals
What Makes Data "Geospatial"
Geospatial data is any data tied to a location on the Earth's surface. It comes in two forms:
Vector data represents discrete features as points, lines, or polygons. A store location is a point. A road is a linestring. A county boundary is a polygon. Vector data is what you will work with 90% of the time as a data scientist.
Raster data represents continuous surfaces as grids of cells (pixels). Satellite imagery, elevation maps, temperature grids. Raster data is the domain of remote sensing and environmental science. We will not cover it here.
The three fundamental geometry types in vector data:
from shapely.geometry import Point, LineString, Polygon
# A point: a single location (longitude, latitude)
store = Point(-73.9857, 40.7484) # Empire State Building
print(f"Store location: {store}")
print(f"Type: {store.geom_type}")
# Store location: POINT (-73.9857 40.7484)
# Type: Point
# A linestring: a sequence of connected points (a path)
delivery_route = LineString([
(-73.9857, 40.7484), # Start: Empire State Building
(-73.9712, 40.7831), # Waypoint: Central Park South
(-73.9680, 40.7851), # End: Central Park
])
print(f"Route length (degrees): {delivery_route.length:.4f}")
# Route length (degrees): 0.0401
# A polygon: a closed shape (a region)
service_zone = Polygon([
(-74.02, 40.70),
(-73.95, 40.70),
(-73.95, 40.80),
(-74.02, 40.80),
(-74.02, 40.70), # Must close the ring
])
print(f"Zone area (sq degrees): {service_zone.area:.4f}")
# Zone area (sq degrees): 0.0070
Warning
--- The numbers above are in degrees, not meters or kilometers. A length of 0.04 degrees means nothing useful without converting to real-world units. This is the projection problem, and we will solve it shortly.
The GeoDataFrame: Pandas with Geometry
Geopandas extends pandas with a geometry column that holds shapely geometry objects. Every other column works exactly like a regular DataFrame.
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
np.random.seed(42)
# Create a regular DataFrame with lat/lon columns
customers = pd.DataFrame({
'customer_id': range(1, 11),
'name': [f'Customer_{i}' for i in range(1, 11)],
'lat': np.random.uniform(40.70, 40.80, 10),
'lon': np.random.uniform(-74.02, -73.95, 10),
'monthly_spend': np.random.uniform(50, 500, 10).round(2),
})
# Convert to GeoDataFrame by creating Point geometries
geometry = [Point(lon, lat) for lon, lat in zip(customers['lon'], customers['lat'])]
customers_gdf = gpd.GeoDataFrame(customers, geometry=geometry, crs='EPSG:4326')
print(customers_gdf.head())
print(f"\nCRS: {customers_gdf.crs}")
print(f"Geometry column: {customers_gdf.geometry.name}")
print(f"Geometry type: {customers_gdf.geom_type.unique()}")
Three things happened in that code:
-
We built a list of
Point(lon, lat)objects. Note the order: longitude first, latitude second. This is the standard in geospatial tools (x, y), even though humans say "lat, lon." Getting this backwards is the single most common geospatial bug. -
We passed
crs='EPSG:4326'to declare the coordinate reference system. EPSG:4326 is WGS84 --- the system used by GPS, Google Maps, and most web APIs. Latitude and longitude in degrees. -
The result is a GeoDataFrame with all the usual pandas methods plus spatial capabilities.
Practical Tip --- If your data has separate
latitudeandlongitudecolumns, usegpd.points_from_xy(df['longitude'], df['latitude'])as a shortcut. It is faster than a list comprehension for large datasets.
# Faster alternative for creating point geometries
customers_gdf2 = gpd.GeoDataFrame(
customers,
geometry=gpd.points_from_xy(customers['lon'], customers['lat']),
crs='EPSG:4326'
)
Loading Geospatial Files
Real geospatial data comes in standard file formats. The three you will encounter most often:
# Shapefiles (.shp) --- the legacy format, still widely used
# A shapefile is actually 3-6 files (.shp, .shx, .dbf, .prj, ...)
# geopandas reads the .shp and finds the rest automatically
# states = gpd.read_file("us_states.shp")
# GeoJSON --- the modern format, human-readable, web-friendly
# Single file, JSON structure with geometry and properties
# states = gpd.read_file("us_states.geojson")
# GeoPackage (.gpkg) --- the recommended format
# Single file, SQLite-based, supports multiple layers
# states = gpd.read_file("us_states.gpkg")
# For this chapter, we'll use built-in datasets
# naturalearth_lowres ships with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
print(world.columns.tolist())
# ['pop_est', 'continent', 'name', 'iso_a3', 'gdp_md_est', 'geometry']
print(world.head(3))
print(f"\nCRS: {world.crs}")
print(f"Number of countries: {len(world)}")
Key Concept
--- Every geospatial file carries its CRS metadata. When you load a shapefile or GeoJSON, geopandas reads the CRS automatically. When you create a GeoDataFrame from lat/lon columns, you must specify the CRS yourself. If you forget, geopandas assumes no CRS, and spatial operations will produce garbage.
Part 2: Coordinate Reference Systems and Projections
The Problem: The Earth Is Not Flat
The Earth is roughly a sphere (technically an oblate spheroid). A coordinate reference system defines how to flatten that sphere onto a 2D coordinate plane. Every CRS involves tradeoffs --- you cannot preserve area, shape, distance, and direction simultaneously.
The two CRS types you need to know:
Geographic CRS (unprojected): Coordinates are in degrees of latitude and longitude on the 3D ellipsoid. EPSG:4326 (WGS84) is the standard. Use this for storing data and for web maps.
Projected CRS: Coordinates are in linear units (meters, feet) on a flat plane. Use this for measuring distances, areas, and performing spatial analysis.
# Demonstrate why projections matter
from shapely.geometry import Point
# Two points in New York City
point_a = Point(-73.9857, 40.7484) # Midtown
point_b = Point(-74.0060, 40.7128) # Downtown
# "Distance" in degrees --- meaningless
dist_degrees = point_a.distance(point_b)
print(f"Distance in degrees: {dist_degrees:.4f}")
# Distance in degrees: 0.0403
# To get real distance, we need to either:
# 1. Project to a local CRS and measure in meters
# 2. Use the haversine formula for geodesic distance
# Method 1: Project to UTM Zone 18N (covers NYC)
import geopandas as gpd
gdf = gpd.GeoDataFrame(
{'name': ['Midtown', 'Downtown']},
geometry=[point_a, point_b],
crs='EPSG:4326'
)
# Reproject to EPSG:32618 (UTM Zone 18N, units in meters)
gdf_projected = gdf.to_crs('EPSG:32618')
dist_meters = gdf_projected.geometry.iloc[0].distance(gdf_projected.geometry.iloc[1])
print(f"Distance in meters: {dist_meters:.0f}")
# Distance in meters: 4506
# Method 2: Haversine formula (no projection needed)
from math import radians, sin, cos, sqrt, atan2
def haversine(lon1, lat1, lon2, lat2):
"""Calculate great-circle distance between two points in km."""
R = 6371 # Earth's radius in km
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
return R * 2 * atan2(sqrt(a), sqrt(1 - a))
dist_km = haversine(-73.9857, 40.7484, -74.0060, 40.7128)
print(f"Haversine distance: {dist_km:.2f} km")
# Haversine distance: 4.50 km
CRS Rules for Data Scientists
The practical rules:
-
Store data in EPSG:4326 (WGS84). This is what GPS devices, web APIs, and geocoding services use. Latitude and longitude in degrees.
-
Project before measuring. Any time you calculate distances, areas, or buffers, reproject to a local projected CRS first. UTM zones work well for regional analysis. For the continental US, EPSG:5070 (Albers Equal Area) preserves area.
-
Match CRS before combining datasets. If you have two GeoDataFrames in different CRS, reproject one to match the other before performing spatial joins or overlays.
-
Always check the CRS after loading. Some files lack CRS metadata. Some have incorrect metadata. A quick sanity check: plot the data and verify it appears where you expect.
# CRS matching example
import geopandas as gpd
# Imagine two datasets in different CRS
gdf_wgs84 = gpd.GeoDataFrame(
{'name': ['NYC']},
geometry=[Point(-73.9857, 40.7484)],
crs='EPSG:4326'
)
gdf_utm = gdf_wgs84.to_crs('EPSG:32618')
# This would produce wrong results:
# gdf_wgs84.sjoin(gdf_utm) # CRS mismatch!
# Always reproject first:
gdf_utm_aligned = gdf_wgs84.to_crs(gdf_utm.crs)
print(f"CRS match: {gdf_utm.crs == gdf_utm_aligned.crs}")
# CRS match: True
Common Mistake --- Calculating distance using
point.distance(other_point)on unprojected (EPSG:4326) data. The result is in degrees, not meters, and the conversion factor varies by latitude. A distance of 0.01 degrees is about 1.1 km at the equator but about 0.7 km at 50 degrees latitude. Always project first or use the haversine formula.
Part 3: Spatial Joins and Point-in-Polygon
The Spatial Join: SQL JOIN Meets Geography
A spatial join connects two datasets based on their spatial relationship rather than a shared key. The most common use case: "For each point, which polygon does it fall inside?"
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon, box
np.random.seed(42)
# Create regions (polygons)
regions = gpd.GeoDataFrame({
'region': ['North', 'South', 'East', 'West'],
'region_manager': ['Alice', 'Bob', 'Carol', 'Dave'],
}, geometry=[
box(-74.05, 40.76, -73.97, 40.82), # North
box(-74.05, 40.70, -73.97, 40.76), # South
box(-73.97, 40.70, -73.90, 40.82), # East
box(-74.12, 40.70, -74.05, 40.82), # West
], crs='EPSG:4326')
# Create customer locations (points)
n_customers = 200
customers = gpd.GeoDataFrame({
'customer_id': range(n_customers),
'monthly_spend': np.random.exponential(150, n_customers).round(2),
'churned': np.random.binomial(1, 0.15, n_customers),
}, geometry=gpd.points_from_xy(
np.random.uniform(-74.12, -73.90, n_customers),
np.random.uniform(40.70, 40.82, n_customers),
), crs='EPSG:4326')
# Spatial join: assign each customer to a region
customers_with_region = gpd.sjoin(
customers, # left: points
regions, # right: polygons
how='left', # keep all customers, even if outside all regions
predicate='within' # point within polygon
)
print(customers_with_region[['customer_id', 'monthly_spend', 'region']].head(10))
print(f"\nCustomers per region:")
print(customers_with_region['region'].value_counts())
# Aggregate: churn rate by region
region_churn = (
customers_with_region
.groupby('region')
.agg(
n_customers=('customer_id', 'count'),
churn_rate=('churned', 'mean'),
avg_spend=('monthly_spend', 'mean'),
)
.round(3)
)
print(f"\n{region_churn}")
How Spatial Joins Work
The predicate parameter defines the spatial relationship:
'within': left geometry is entirely inside right geometry (point in polygon)'intersects': geometries share any space (overlapping polygons)'contains': right geometry is entirely inside left geometry (inverse of within)
Under the hood, geopandas uses a spatial index (R-tree) to avoid comparing every point against every polygon. Without the spatial index, a join of 100,000 points against 3,000 polygons would require 300 million comparisons. With the spatial index, it runs in seconds.
# Point-in-polygon: the manual way (for understanding)
from shapely.geometry import Point
point = Point(-74.00, 40.75)
north_region = regions[regions['region'] == 'North'].geometry.iloc[0]
south_region = regions[regions['region'] == 'South'].geometry.iloc[0]
print(f"Point in North region: {north_region.contains(point)}")
print(f"Point in South region: {south_region.contains(point)}")
Practical Tip --- When your spatial join drops rows unexpectedly, check three things: (1) CRS mismatch between the two GeoDataFrames, (2) points outside all polygons (use
how='left'to keep them as NaN), (3) longitude/latitude swapped in one dataset.
Spatial Joins at Scale: The StreamFlow Example
StreamFlow, the SaaS company from Chapter 1, wants to understand regional churn patterns. They have customer locations and US state boundaries.
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
np.random.seed(42)
# Load US state boundaries (simplified for demo)
# In practice: gpd.read_file("us_states.geojson")
us_states = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
us_states = us_states[us_states['name'] == 'United States of America']
# For a more realistic example, simulate state-level data
state_names = [
'California', 'Texas', 'New York', 'Florida', 'Illinois',
'Pennsylvania', 'Ohio', 'Georgia', 'North Carolina', 'Michigan',
'New Jersey', 'Virginia', 'Washington', 'Arizona', 'Massachusetts',
'Tennessee', 'Indiana', 'Missouri', 'Maryland', 'Colorado'
]
# Approximate state centroids (lon, lat)
state_coords = {
'California': (-119.4, 36.8), 'Texas': (-99.9, 31.9),
'New York': (-75.5, 43.0), 'Florida': (-81.5, 27.6),
'Illinois': (-89.4, 40.6), 'Pennsylvania': (-77.2, 41.2),
'Ohio': (-82.9, 40.4), 'Georgia': (-83.5, 32.9),
'North Carolina': (-79.0, 35.8), 'Michigan': (-84.5, 44.3),
'New Jersey': (-74.4, 40.1), 'Virginia': (-78.2, 37.8),
'Washington': (-120.7, 47.8), 'Arizona': (-111.1, 34.0),
'Massachusetts': (-71.4, 42.4), 'Tennessee': (-86.6, 35.5),
'Indiana': (-86.1, 40.3), 'Missouri': (-91.8, 38.6),
'Maryland': (-76.6, 39.0), 'Colorado': (-105.3, 39.6),
}
# Simulate StreamFlow customers across states
n_customers = 5000
customer_states = np.random.choice(state_names, n_customers, p=[
0.15, 0.12, 0.10, 0.09, 0.07, # CA, TX, NY, FL, IL
0.06, 0.05, 0.05, 0.04, 0.04, # PA, OH, GA, NC, MI
0.04, 0.03, 0.03, 0.03, 0.03, # NJ, VA, WA, AZ, MA
0.02, 0.02, 0.01, 0.01, 0.01, # TN, IN, MO, MD, CO
])
# Base churn rate with regional variation
state_churn_base = {
'California': 0.12, 'Texas': 0.18, 'New York': 0.10,
'Florida': 0.22, 'Illinois': 0.16, 'Pennsylvania': 0.14,
'Ohio': 0.20, 'Georgia': 0.19, 'North Carolina': 0.17,
'Michigan': 0.21, 'New Jersey': 0.11, 'Virginia': 0.13,
'Washington': 0.09, 'Arizona': 0.23, 'Massachusetts': 0.10,
'Tennessee': 0.19, 'Indiana': 0.20, 'Missouri': 0.18,
'Maryland': 0.12, 'Colorado': 0.11,
}
customers_df = pd.DataFrame({
'customer_id': range(n_customers),
'state': customer_states,
'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n_customers, p=[0.5, 0.35, 0.15]),
'tenure_months': np.random.exponential(18, n_customers).astype(int).clip(1, 72),
'monthly_revenue': np.random.lognormal(4.5, 0.8, n_customers).round(2),
})
customers_df['churned'] = [
np.random.binomial(1, state_churn_base[s])
for s in customers_df['state']
]
# Aggregate by state
state_summary = (
customers_df
.groupby('state')
.agg(
n_customers=('customer_id', 'count'),
churn_rate=('churned', 'mean'),
avg_revenue=('monthly_revenue', 'mean'),
avg_tenure=('tenure_months', 'mean'),
)
.round(3)
.sort_values('churn_rate', ascending=False)
)
print("StreamFlow churn by state (top 10):")
print(state_summary.head(10))
The pattern is immediately visible: certain states have substantially higher churn rates. But is this geography, or is it confounded by plan mix or tenure? We will return to this question in the case study.
Part 4: Creating Maps with Folium
Why Maps Matter for Data Science
A table of churn rates by state is useful. A choropleth map of churn rates by state is compelling. Maps exploit the human visual system's ability to detect spatial patterns --- clusters, gradients, outliers --- that are invisible in tabular data. When you present to stakeholders, a map communicates "the Southeast has a churn problem" in two seconds. A sorted table takes thirty.
Choropleth Maps
A choropleth map colors regions by a variable. Light color means low value, dark means high.
import folium
import json
# Create a choropleth map of churn rates
# First, build the GeoJSON (simplified state boundaries for demo)
# In practice, you would load actual state boundary GeoJSON
# Using folium's built-in choropleth with state data
m = folium.Map(location=[39.8, -98.5], zoom_start=4, tiles='cartodbpositron')
# folium.Choropleth requires:
# 1. geo_data: GeoJSON with region boundaries
# 2. data: pandas DataFrame or Series with values to map
# 3. columns: [key_column, value_column]
# 4. key_on: property in GeoJSON that matches key_column
# Example with US states (using folium's built-in US states URL)
state_churn_df = state_summary.reset_index()[['state', 'churn_rate']]
us_states_url = 'https://raw.githubusercontent.com/python-visualization/folium/main/tests/us-states.json'
folium.Choropleth(
geo_data=us_states_url,
data=state_churn_df,
columns=['state', 'churn_rate'],
key_on='feature.properties.name',
fill_color='YlOrRd', # Yellow-Orange-Red color scale
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Churn Rate',
nan_fill_color='white',
).add_to(m)
# Save to HTML (viewable in any browser)
m.save('streamflow_churn_choropleth.html')
print("Map saved to streamflow_churn_choropleth.html")
Practical Tip --- Folium creates interactive HTML maps. In a Jupyter notebook, the map renders inline. In a production pipeline, save to HTML and embed in a dashboard or email. For static images (reports, PDFs), use matplotlib with geopandas' built-in
.plot()method instead.
Static Maps with Geopandas
When you need a quick plot for exploration or a static image for a report:
import matplotlib.pyplot as plt
import geopandas as gpd
# Using the world dataset for demonstration
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Left: GDP per capita choropleth
world['gdp_per_cap'] = world['gdp_md_est'] / world['pop_est'] * 1e6
world.plot(
column='gdp_per_cap',
cmap='viridis',
legend=True,
legend_kwds={'label': 'GDP per Capita ($)', 'shrink': 0.5},
ax=axes[0],
missing_kwds={'color': 'lightgrey'},
)
axes[0].set_title('GDP per Capita by Country')
axes[0].set_axis_off()
# Right: Population choropleth
world.plot(
column='pop_est',
cmap='YlOrRd',
scheme='quantiles', # Equal-count bins
legend=True,
ax=axes[1],
missing_kwds={'color': 'lightgrey'},
)
axes[1].set_title('Population by Country')
axes[1].set_axis_off()
plt.tight_layout()
plt.savefig('world_choropleths.png', dpi=150, bbox_inches='tight')
plt.show()
Point Maps
When your data is points (customer locations, store locations, events), use markers or circle markers:
import folium
import numpy as np
np.random.seed(42)
# ShopSmart warehouse and customer locations
warehouses = pd.DataFrame({
'name': ['NYC Warehouse', 'LA Warehouse', 'Chicago Warehouse', 'Dallas Warehouse'],
'lat': [40.7128, 34.0522, 41.8781, 32.7767],
'lon': [-74.0060, -118.2437, -87.6298, -96.7970],
'capacity': [5000, 8000, 4000, 3500],
})
# Create map centered on US
m = folium.Map(location=[39.8, -98.5], zoom_start=4, tiles='cartodbpositron')
# Add warehouse markers
for _, row in warehouses.iterrows():
folium.Marker(
location=[row['lat'], row['lon']],
popup=f"{row['name']}<br>Capacity: {row['capacity']}",
tooltip=row['name'],
icon=folium.Icon(color='red', icon='warehouse', prefix='fa'),
).add_to(m)
# Add customer clusters (simulated)
n_customers = 500
customer_lats = np.concatenate([
np.random.normal(loc, 2, n_customers // 4)
for loc in [40.7, 34.0, 41.9, 32.8]
])
customer_lons = np.concatenate([
np.random.normal(loc, 2, n_customers // 4)
for loc in [-74.0, -118.2, -87.6, -96.8]
])
# Use MarkerCluster for large point datasets
from folium.plugins import MarkerCluster
marker_cluster = MarkerCluster().add_to(m)
for lat, lon in zip(customer_lats[:200], customer_lons[:200]): # Subset for speed
folium.CircleMarker(
location=[lat, lon],
radius=3,
color='blue',
fill=True,
fill_opacity=0.5,
).add_to(marker_cluster)
m.save('shopsmart_locations.html')
Key Concept
--- For datasets with thousands of points, use MarkerCluster from folium.plugins. It groups nearby markers at low zoom levels and expands them as you zoom in. Without clustering, rendering 10,000+ markers will freeze the browser.
Part 5: Engineering Location-Based Features
Location Features for Machine Learning
Raw latitude and longitude are poor features for most ML models. A tree-based model might split on latitude > 40.0, but this creates arbitrary geographic boundaries that do not correspond to real patterns. Better: engineer features that encode meaningful spatial relationships.
The four location feature types you should know:
1. Distance to nearest X
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from scipy.spatial import cKDTree
np.random.seed(42)
# ShopSmart: distance from each customer to nearest warehouse
warehouse_coords = np.array([
[-74.0060, 40.7128], # NYC
[-118.2437, 34.0522], # LA
[-87.6298, 41.8781], # Chicago
[-96.7970, 32.7767], # Dallas
])
# Simulate 1000 customers across the US
n = 1000
customer_coords = np.column_stack([
np.random.uniform(-120, -70, n), # longitude
np.random.uniform(25, 48, n), # latitude
])
# Method 1: Haversine distance (accurate for geographic data)
def haversine_vectorized(lon1, lat1, lon2, lat2):
"""Vectorized haversine distance in km."""
R = 6371
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
return R * 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
# Calculate distance to each warehouse, take minimum
distances = np.column_stack([
haversine_vectorized(
customer_coords[:, 0], customer_coords[:, 1],
wh[0], wh[1]
)
for wh in warehouse_coords
])
min_distances = distances.min(axis=1)
nearest_warehouse = distances.argmin(axis=1)
customers = pd.DataFrame({
'lon': customer_coords[:, 0],
'lat': customer_coords[:, 1],
'dist_nearest_warehouse_km': min_distances.round(1),
'nearest_warehouse_idx': nearest_warehouse,
})
print(customers.describe()[['dist_nearest_warehouse_km']])
print(f"\nWarehouse assignment distribution:")
print(customers['nearest_warehouse_idx'].value_counts().sort_index())
2. Count within radius
# How many competitors are within 50 km of each ShopSmart warehouse?
competitor_coords = np.column_stack([
np.random.uniform(-120, -70, 50),
np.random.uniform(25, 48, 50),
])
radius_km = 50
competitor_counts = []
for wh in warehouse_coords:
dists = haversine_vectorized(
wh[0], wh[1],
competitor_coords[:, 0], competitor_coords[:, 1]
)
competitor_counts.append((dists <= radius_km).sum())
print("Competitors within 50 km of each warehouse:")
for i, name in enumerate(['NYC', 'LA', 'Chicago', 'Dallas']):
print(f" {name}: {competitor_counts[i]}")
3. Region-based aggregates (from spatial join)
# After spatial join: use region as a feature or compute regional stats
# This leverages the spatial join from Part 3
# Example: encode regional churn rate as a feature
# (proxy for local competition, network quality, cultural factors)
state_churn_feature = (
customers_df
.groupby('state')['churned']
.mean()
.rename('state_churn_rate')
)
customers_enriched = customers_df.merge(
state_churn_feature,
left_on='state',
right_index=True,
)
print(customers_enriched[['customer_id', 'state', 'churned', 'state_churn_rate']].head())
Warning
--- Using the target variable's regional aggregate as a feature introduces leakage if not handled carefully. Use leave-one-out encoding or compute the aggregate on the training set only. Chapter 6 covered this in the context of target encoding.
4. Buffer analysis
import geopandas as gpd
from shapely.geometry import Point
# Create a buffer around each warehouse (50 km radius)
# Must project to meters first!
warehouses_gdf = gpd.GeoDataFrame(
{'name': ['NYC', 'LA', 'Chicago', 'Dallas']},
geometry=[Point(c) for c in warehouse_coords],
crs='EPSG:4326'
)
# Project to an equal-area projection for accurate buffering
warehouses_projected = warehouses_gdf.to_crs('EPSG:5070') # Albers Equal Area (US)
# Buffer: 50 km = 50,000 meters
warehouses_buffered = warehouses_projected.copy()
warehouses_buffered['geometry'] = warehouses_projected.buffer(50_000)
# Project back to WGS84 for mapping
warehouses_buffered = warehouses_buffered.to_crs('EPSG:4326')
print("Buffer geometries created:")
for _, row in warehouses_buffered.iterrows():
print(f" {row['name']}: {row.geometry.geom_type}, "
f"area = {row.geometry.area:.4f} sq degrees")
Common Mistake --- Calling
.buffer(50000)on unprojected data (EPSG:4326). The buffer distance is interpreted as 50,000 degrees, not meters, producing a buffer that covers half the planet. Always project to a meter-based CRS before buffering.
Using Spatial Features in a Model
Putting it together: distance-based features in a churn prediction model.
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
np.random.seed(42)
# Build feature matrix with spatial features
features_df = customers_df.copy()
# Encode categorical
le = LabelEncoder()
features_df['plan_encoded'] = le.fit_transform(features_df['plan_tier'])
# Add spatial features
features_df = features_df.merge(
state_churn_feature,
left_on='state',
right_index=True,
)
# Simulate distance-to-nearest-datacenter feature
state_distances = {
'California': 50, 'Texas': 800, 'New York': 100, 'Florida': 1200,
'Illinois': 200, 'Pennsylvania': 150, 'Ohio': 300, 'Georgia': 900,
'North Carolina': 500, 'Michigan': 350, 'New Jersey': 80, 'Virginia': 200,
'Washington': 150, 'Arizona': 600, 'Massachusetts': 120, 'Tennessee': 700,
'Indiana': 250, 'Missouri': 500, 'Maryland': 180, 'Colorado': 400,
}
features_df['dist_datacenter_km'] = features_df['state'].map(state_distances)
# Feature set
feature_cols = ['plan_encoded', 'tenure_months', 'monthly_revenue',
'dist_datacenter_km', 'state_churn_rate']
X = features_df[feature_cols]
y = features_df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Model without spatial features
model_no_geo = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
)
scores_no_geo = cross_val_score(
model_no_geo, X_train[['plan_encoded', 'tenure_months', 'monthly_revenue']],
y_train, cv=5, scoring='roc_auc'
)
# Model with spatial features
model_with_geo = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
)
scores_with_geo = cross_val_score(
model_with_geo, X_train, y_train, cv=5, scoring='roc_auc'
)
print(f"AUC without spatial features: {scores_no_geo.mean():.4f} +/- {scores_no_geo.std():.4f}")
print(f"AUC with spatial features: {scores_with_geo.mean():.4f} +/- {scores_with_geo.std():.4f}")
print(f"Improvement: {(scores_with_geo.mean() - scores_no_geo.mean()) * 100:.2f} percentage points")
Part 6: Geocoding and Real-World Data Challenges
Geocoding: From Addresses to Coordinates
Geocoding converts street addresses to latitude/longitude coordinates. Reverse geocoding does the opposite.
# Geocoding with geopy (free, rate-limited)
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
# Initialize geocoder with a user agent string
geolocator = Nominatim(user_agent="data_science_textbook")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
# Geocode a single address
# location = geocode("350 Fifth Avenue, New York, NY")
# print(f"Lat: {location.latitude}, Lon: {location.longitude}")
# Lat: 40.7484, Lon: -73.9857
# For batch geocoding, apply to a DataFrame column
# df['location'] = df['address'].apply(geocode)
# df['lat'] = df['location'].apply(lambda loc: loc.latitude if loc else None)
# df['lon'] = df['location'].apply(lambda loc: loc.longitude if loc else None)
Practical Tip --- Free geocoding services (Nominatim, US Census) have rate limits: 1 request per second for Nominatim. For batch geocoding thousands of addresses, use a paid service (Google Maps, Mapbox, HERE) or the US Census batch geocoder (free, no rate limit, US addresses only, up to 10,000 per batch).
Common Geospatial Data Problems
Problem 1: Lat/lon are swapped. Your data has customers in the middle of the ocean. Check: latitude should be between -90 and 90, longitude between -180 and 180. If a "latitude" value is -73.98, it is actually a longitude.
# Quick sanity check
def validate_coordinates(df, lat_col, lon_col):
"""Flag rows with potentially swapped or invalid coordinates."""
issues = pd.DataFrame(index=df.index)
issues['lat_out_of_range'] = ~df[lat_col].between(-90, 90)
issues['lon_out_of_range'] = ~df[lon_col].between(-180, 180)
issues['possibly_swapped'] = (
df[lat_col].between(-180, -90) | df[lat_col].between(90, 180)
) & df[lon_col].between(-90, 90)
n_issues = issues.any(axis=1).sum()
if n_issues > 0:
print(f"WARNING: {n_issues} rows with coordinate issues")
print(issues.sum())
else:
print("All coordinates valid")
return issues
Problem 2: Mixed CRS. One dataset uses EPSG:4326, another uses a local projection. The spatial join silently produces wrong results. Always check .crs on both GeoDataFrames before joining.
Problem 3: Points on polygon boundaries. A point that falls exactly on the boundary between two polygons may be assigned to either one, or to neither. Use predicate='intersects' instead of predicate='within' if boundary cases matter.
Problem 4: Null geometries. Geocoding failures produce null geometries. Filter them before spatial operations.
# Clean null geometries
# gdf_clean = gdf[gdf.geometry.notna() & ~gdf.is_empty]
Part 7: Putting It All Together --- The ShopSmart Delivery Analysis
Let us build a complete geospatial analysis pipeline for ShopSmart's delivery optimization problem.
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from math import radians, sin, cos, sqrt, atan2
np.random.seed(42)
# --- 1. Generate delivery data ---
n_orders = 3000
warehouse_info = {
'NYC': {'lon': -74.006, 'lat': 40.713, 'capacity': 5000},
'LA': {'lon': -118.244, 'lat': 34.052, 'capacity': 8000},
'CHI': {'lon': -87.630, 'lat': 41.878, 'capacity': 4000},
'DAL': {'lon': -96.797, 'lat': 32.777, 'capacity': 3500},
}
# Customer locations clustered around metros
metro_centers = [
(-74.0, 40.7, 0.25), (-118.2, 34.0, 0.30),
(-87.6, 41.9, 0.20), (-96.8, 32.8, 0.25),
(-122.4, 37.8, 0.15), (-95.4, 29.8, 0.20),
(-80.2, 25.8, 0.20), (-71.1, 42.4, 0.15),
]
orders_data = []
for i in range(n_orders):
center = metro_centers[np.random.randint(len(metro_centers))]
clon = center[0] + np.random.normal(0, center[2])
clat = center[1] + np.random.normal(0, center[2])
# Find nearest warehouse
min_dist = float('inf')
nearest_wh = None
for wh_name, wh_info in warehouse_info.items():
d = haversine(clon, clat, wh_info['lon'], wh_info['lat'])
if d < min_dist:
min_dist = d
nearest_wh = wh_name
# Delivery time: base + distance effect + noise
base_days = 1.5
distance_effect = min_dist / 500 # ~1 day per 500 km
noise = np.random.exponential(0.5)
delivery_days = round(base_days + distance_effect + noise, 1)
# Customer satisfaction: inversely related to delivery time
sat_base = 5.0 - (delivery_days - 1.5) * 0.4 + np.random.normal(0, 0.5)
satisfaction = np.clip(round(sat_base, 1), 1.0, 5.0)
orders_data.append({
'order_id': i,
'customer_lon': round(clon, 4),
'customer_lat': round(clat, 4),
'nearest_warehouse': nearest_wh,
'distance_to_warehouse_km': round(min_dist, 1),
'delivery_days': delivery_days,
'satisfaction_score': satisfaction,
})
orders = pd.DataFrame(orders_data)
# --- 2. Convert to GeoDataFrame ---
orders_gdf = gpd.GeoDataFrame(
orders,
geometry=gpd.points_from_xy(orders['customer_lon'], orders['customer_lat']),
crs='EPSG:4326'
)
# --- 3. Analyze delivery performance by warehouse ---
warehouse_performance = (
orders
.groupby('nearest_warehouse')
.agg(
n_orders=('order_id', 'count'),
avg_distance_km=('distance_to_warehouse_km', 'mean'),
avg_delivery_days=('delivery_days', 'mean'),
avg_satisfaction=('satisfaction_score', 'mean'),
pct_over_5_days=('delivery_days', lambda x: (x > 5).mean()),
)
.round(3)
)
print("Delivery Performance by Warehouse:")
print(warehouse_performance)
# --- 4. Distance bins as features ---
orders['distance_bin'] = pd.cut(
orders['distance_to_warehouse_km'],
bins=[0, 100, 300, 500, 1000, 5000],
labels=['<100km', '100-300km', '300-500km', '500-1000km', '>1000km']
)
distance_analysis = (
orders
.groupby('distance_bin', observed=True)
.agg(
n_orders=('order_id', 'count'),
avg_delivery_days=('delivery_days', 'mean'),
avg_satisfaction=('satisfaction_score', 'mean'),
)
.round(3)
)
print("\nDelivery Performance by Distance:")
print(distance_analysis)
The relationship is clear: delivery distance drives delivery time, and delivery time drives satisfaction. This is the feature engineering insight: distance_to_nearest_warehouse is a stronger predictor of customer satisfaction than raw latitude or longitude.
Domain Knowledge --- This is Theme 4 in action. A model given raw lat/lon coordinates might learn a weak spatial signal. A model given
distance_to_nearest_warehouse_kmlearns the actual causal mechanism. The geospatial feature engineering step --- converting coordinates to meaningful distances --- is where domain knowledge meets data science.
Chapter Summary
This chapter covered the geospatial toolkit that every data scientist needs:
-
GeoDataFrames extend pandas with a geometry column. The API is identical to pandas plus spatial methods. Use
gpd.points_from_xy()for point data,gpd.read_file()for shapefiles and GeoJSON. -
CRS and projections determine whether your measurements are in degrees or meters. Store data in EPSG:4326 (WGS84). Project to a local CRS before measuring distances or areas. The haversine formula is the alternative for point-to-point distances without projection.
-
Spatial joins (
gpd.sjoin) connect datasets by geography rather than keys. Point-in-polygon is the most common operation: "Which region does this customer fall in?" -
Folium creates interactive choropleth and point maps. Geopandas
.plot()creates static maps for exploration and reports. -
Location-based feature engineering converts raw coordinates into ML-ready features: distance to nearest facility, count within radius, regional aggregates, buffer zones.
-
The common mistakes are all preventable: swapped lat/lon, CRS mismatch, buffering in degrees, and spatial joins that silently drop rows.
The case studies that follow apply these tools to StreamFlow's regional churn analysis and ShopSmart's delivery optimization. Both demonstrate the same principle: location is not just a column in your data. It is a dimension of your analysis.
Next: Case Study 1 --- StreamFlow Regional Churn Choropleth | Exercises | Quiz