Key Takeaways: Chapter 27

Working with Geospatial Data: Maps, Spatial Joins, and Location-Based Analysis


  1. A GeoDataFrame is a pandas DataFrame with a geometry column. Everything you know about pandas --- filtering, grouping, merging, aggregation --- works identically. The geometry column adds spatial methods: .distance(), .buffer(), .contains(), .intersects(). If you can write pandas code, you can write geopandas code. The learning curve is the spatial concepts, not the API.

  2. Longitude comes first, latitude second. Shapely's Point(lon, lat) follows the mathematical (x, y) convention, not the human "lat, lon" convention. Getting this backwards places your data in the wrong hemisphere, and geopandas will not raise an error. This is the single most common geospatial bug. Verify every new dataset by plotting it and checking that points appear where you expect.

  3. Store in EPSG:4326, project before measuring. GPS coordinates, web APIs, and geocoding services use EPSG:4326 (WGS84) --- latitude and longitude in degrees. For storage and exchange, this is the standard. But degrees are not meters: .distance() on unprojected data returns degrees, not kilometers, and the conversion factor varies by latitude. Before calculating distances, areas, or buffers, reproject to a local projected CRS (UTM for regional analysis, Albers Equal Area for national analysis) or use the haversine formula.

  4. Spatial joins replace key-based joins when the relationship is geographic. gpd.sjoin(points, polygons, predicate='within') assigns each point to the polygon it falls inside. This is the geospatial equivalent of a SQL JOIN --- but the join key is location, not a shared column. The spatial index (R-tree) makes this fast even for hundreds of thousands of points. Always check that both GeoDataFrames share the same CRS before joining.

  5. Choropleth maps communicate spatial patterns faster than tables. A sorted table of churn rates by state takes thirty seconds to parse. A color-coded map communicates "the Southeast has a problem" in two seconds. Use folium for interactive maps (stakeholder presentations, dashboards) and geopandas .plot() for static maps (reports, notebooks). Choose sequential color scales for unipolar data (low-to-high), diverging scales for bipolar data (below-above a midpoint).

  6. Raw latitude and longitude are poor ML features. A tree-based model splitting on latitude > 40.0 creates an arbitrary geographic boundary. Better features encode meaningful spatial relationships: distance to nearest facility, count of competitors within a radius, regional aggregate statistics, buffer zone membership. These features are interpretable, generalizable, and often capture the actual causal mechanism (delivery time depends on distance, not on raw coordinates).

  7. The haversine formula is the data scientist's geodesic distance tool. It computes great-circle distance between two points on a sphere, given latitude and longitude. It assumes a spherical Earth (introducing up to 0.5% error vs. the Vincenty formula on an ellipsoid), which is accurate enough for virtually all data science applications. Use it when you need point-to-point distances without reprojecting your entire dataset.

  8. Buffer analysis requires a projected CRS. Calling .buffer(50000) on EPSG:4326 data creates a buffer of 50,000 degrees, not meters, producing a buffer that covers most of the planet. Geopandas will not warn you. The fix: reproject to a meter-based CRS, buffer, then reproject back. This applies to any operation where the argument is in distance units: buffer, simplify, or any custom function that uses .distance().

  9. Geospatial features carry leakage risk when derived from the target. A state_churn_rate feature computed from the full dataset leaks the test set's churn information into the training features. Use leave-one-out encoding within cross-validation folds, or compute regional aggregates on the training set only. This is the same target-encoding leakage problem from Chapter 6, applied to spatial aggregates.

  10. Location is a dimension of analysis, not just a column in the data. The StreamFlow case study showed that churn varies geographically even after controlling for plan, tenure, and revenue. The ShopSmart case study showed that delivery distance is the primary driver of customer satisfaction and that the optimal location for a new fulfillment center is a spatial optimization problem. In both cases, the geospatial analysis identified specific, actionable insights that a non-spatial analysis would have missed entirely.


If You Remember One Thing

Do not treat latitude and longitude as ordinary numerical features. They are coordinates in a reference system, and working with them correctly requires projections for measurement, spatial joins for aggregation, and domain-aware feature engineering for modeling. The tools --- geopandas, shapely, folium, haversine --- are straightforward. The discipline is in always checking your CRS, always projecting before measuring, and always engineering features that encode meaningful spatial relationships rather than feeding raw coordinates into a model.


These takeaways summarize Chapter 27: Working with Geospatial Data. Return to the chapter for full context.